Evaluation Methods
Spec27 supports three evaluation methods for deciding whether an output passed.
The right method depends on the kind of task you are evaluating. Some tasks have one exact correct answer. Some allow a small set of acceptable answers. Others require interpretation.
This page helps you choose the method that fits your evaluation.
Strict equality
Use strict equality when the output must match the expected answer exactly.
Best for:
- deterministic responses
- exact answer checks
- baseline validation
Avoid it when:
- multiple outputs are acceptable
- wording can vary while still being correct
- the task needs interpretation
If the expected answer is fixed and exact, this is usually the clearest option.
Permitted values
Use permitted values when multiple outputs are acceptable, but the set is still constrained.
Best for:
- fixed labels
- multiple approved variants
- simple classification-style outputs
Avoid it when:
- the accepted space is too large to enumerate
- you need explanation, rubric logic, or nuanced judgment
This method is useful when the output can vary, but only within a controlled set.
Judge-based scoring
Use judge-based scoring when correctness depends on interpretation rather than exact matching.
Best for:
- rubric-based reviews
- nuanced policy checks
- outputs where explanation and scoring matter
Judge-based runs can include a structured score, explanation, and vote details.
Judge-based scoring is the normal path for Red Team specifications.
How to choose
- Start with strict equality when one exact answer is correct.
- Use permitted values when several answers are acceptable, but the set is still limited.
- Use judge-based scoring when the result depends on meaning, quality, policy, or context.
Team-specific guidance
- In Gold Team work, all three methods can make sense depending on the task. If you are checking whether an agent stays correct under natural user variation, you will often start with strict equality or permitted values.
- In Red Team work, judge-based evaluation is the expected path because harmful behavior usually requires interpretation. You often need to decide whether the model meaningfully complied with a harmful request or whether the refusal was sufficient.