Evaluating Prompt Quality

"It works" is not a quality assessment. Professional prompt engineering requires structured evaluation across multiple dimensions, just as software engineering requires more than "the code runs."

The Five Dimensions

1. Correctness

Does the output contain accurate information and follow instructions? This is the minimum bar. A prompt that produces eloquent but wrong answers has failed.

2. Consistency

Does the prompt produce similar quality across multiple runs? A prompt that works brilliantly once and fails twice is unreliable. Consistency matters more than peak performance for production systems.

3. Efficiency

Does the prompt achieve its goal with reasonable token usage? An effective prompt that consumes ten times the necessary tokens is poorly engineered. Efficiency includes both input and output token counts.

4. Safety

Does the prompt resist misuse, avoid harmful outputs, and handle edge cases gracefully? A prompt that works perfectly on expected inputs but fails dangerously on unexpected ones is a liability.

5. Maintainability

Can the prompt be understood, modified, and extended by someone other than its author? Prompts that rely on obscure tricks or undocumented model quirks are technical debt waiting to accumulate.

Evaluation in Practice

For each dimension, define what "good" looks like for your use case. Not every prompt needs to score highly on every dimension. A one-off research query prioritizes correctness and can ignore maintainability. A production customer service prompt must score highly on all five.

Create a simple scorecard:

Run the prompt ten times with varied inputs
Rate each dimension on a 1-5 scale
Identify the weakest dimension
Improve the prompt targeting that specific weakness

The Trap of Anecdotal Testing

The most common evaluation mistake is testing a prompt once, seeing a good result, and declaring it done. Single-run evaluation tells you almost nothing about prompt quality. Models are stochastic — the same prompt can produce different outputs on different runs. Always test with multiple inputs and multiple runs before judging a prompt's quality.