Evaluation Methods

Spec27 supports three evaluation methods for deciding whether an output passed.

The right method depends on the kind of task you are evaluating. Some tasks have one exact correct answer. Some allow a small set of acceptable answers. Others require interpretation.

This page helps you choose the method that fits your evaluation.

Strict equality

Use strict equality when the output must match the expected answer exactly.

Best for:

deterministic responses
exact answer checks
baseline validation

Avoid it when:

multiple outputs are acceptable
wording can vary while still being correct
the task needs interpretation

If the expected answer is fixed and exact, this is usually the clearest option.

Permitted values

Use permitted values when multiple outputs are acceptable, but the set is still constrained.

Best for:

fixed labels
multiple approved variants
simple classification-style outputs

Avoid it when:

the accepted space is too large to enumerate
you need explanation, rubric logic, or nuanced judgment

This method is useful when the output can vary, but only within a controlled set.

Judge-based scoring

Use judge-based scoring when correctness depends on interpretation rather than exact matching.

Best for:

rubric-based reviews
nuanced policy checks
outputs where explanation and scoring matter

Judge-based runs can include a structured score, explanation, and vote details.

Judge-based scoring is the normal path for Red Team specifications.

How to choose

Start with strict equality when one exact answer is correct.
Use permitted values when several answers are acceptable, but the set is still limited.
Use judge-based scoring when the result depends on meaning, quality, policy, or context.

Team-specific guidance

In Gold Team work, all three methods can make sense depending on the task. If you are checking whether an agent stays correct under natural user variation, you will often start with strict equality or permitted values.
In Red Team work, judge-based evaluation is the expected path because harmful behavior usually requires interpretation. You often need to decide whether the model meaningfully complied with a harmful request or whether the refusal was sufficient.

Strict equality​

Permitted values​

Judge-based scoring​

How to choose​

Team-specific guidance​

Related pages​

Strict equality

Permitted values

Judge-based scoring

How to choose

Team-specific guidance

Related pages