Specification-Driven Repeatability
Spec27 uses specifications to turn evaluation intent into a reusable, machine-readable setup that can be rerun as agents, prompts, and models change.
This is the main concept behind repeatability in the product. Instead of manually recreating a test every time, you define the standard once and reuse it.
What a specification is
A specification is the reusable evaluation recipe.
It defines what should be tested and how it should be evaluated. Depending on the workflow, a specification can include:
- the primary dataset
- the evaluation method
- any required judges
- attack methods or adversarial selections
- the execution mode that determines how the evaluation runs
A specification is not the run itself. It is the saved definition that an eval can reuse over time.
Why specifications create repeatability
Without specifications, testing often depends on who ran the prompts, which examples they remembered to try, and how they interpreted the outputs that day.
With specifications, the setup is explicit. That makes it easier to answer questions like:
- did the agent improve after the latest change
- did a prompt update introduce regressions
- are two agents being tested against the same standard
- can the team rerun the same workflow next week or next month
What repeatability looks like in practice
In practice, teams use specifications to separate evaluation setup from execution:
- define the reusable specification once
- attach it to one or more evals
- run those evals when needed
- compare results across repeated runs
That separation is what makes the workflow repeatable. The same specification can be reused even when the agent changes.
Precision also depends on scoring
Repeatability becomes much more useful when the scoring method is also explicit.
Spec27 supports several evaluation methods so the specification can match the task:
- exact matching when there is one correct output
- permitted values when multiple outputs are acceptable
- judge-based scoring when correctness requires interpretation
This makes the specification both reusable and precise enough to support comparison over time.
Example: reusing the same specification across versions
Suppose a team updates an agent's prompt and wants to know whether behavior improved or regressed.
Instead of repeating a fresh round of manual checks, they can:
- keep the existing specification
- attach the updated agent to the eval
- rerun the workflow
- review the results against earlier runs
That gives the team stronger evidence than informal spot checks because the standard has stayed the same while the agent changed.