Evaluating Prompt Quality
"It works" is not a quality assessment. Professional prompt engineering requires structured evaluation across multiple dimensions, just as software engineering requires more than "the code runs."
The Five Dimensions
1. Correctness
Does the output contain accurate information and follow instructions? This is the minimum bar. A prompt that produces eloquent but wrong answers has failed.
2. Consistency
Does the prompt produce similar quality across multiple runs? A prompt that works brilliantly once and fails twice is unreliable. Consistency matters more than peak performance for production systems.
3. Efficiency
Does the prompt achieve its goal with reasonable token usage? An effective prompt that consumes ten times the necessary tokens is poorly engineered. Efficiency includes both input and output token counts.
4. Safety
Does the prompt resist misuse, avoid harmful outputs, and handle edge cases gracefully? A prompt that works perfectly on expected inputs but fails dangerously on unexpected ones is a liability.
5. Maintainability
Can the prompt be understood, modified, and extended by someone other than its author? Prompts that rely on obscure tricks or undocumented model quirks are technical debt waiting to accumulate.
Evaluation in Practice
For each dimension, define what "good" looks like for your use case. Not every prompt needs to score highly on every dimension. A one-off research query prioritizes correctness and can ignore maintainability. A production customer service prompt must score highly on all five.
Create a simple scorecard:
- Run the prompt ten times with varied inputs
- Rate each dimension on a 1-5 scale
- Identify the weakest dimension
- Improve the prompt targeting that specific weakness
The Trap of Anecdotal Testing
The most common evaluation mistake is testing a prompt once, seeing a good result, and declaring it done. Single-run evaluation tells you almost nothing about prompt quality. Models are stochastic — the same prompt can produce different outputs on different runs. Always test with multiple inputs and multiple runs before judging a prompt's quality.