Specification-Driven Repeatability

Spec27 uses specifications to turn evaluation intent into a reusable, machine-readable setup that can be rerun as agents, prompts, and models change.

This is the main concept behind repeatability in the product. Instead of manually recreating a test every time, you define the standard once and reuse it.

What a specification is

A specification is the reusable evaluation recipe.

It defines what should be tested and how it should be evaluated. Depending on the workflow, a specification can include:

your initial entries
the evaluation method, including judge-based scoring when needed
attack methods or adversarial selections
the execution mode that determines how the evaluation runs

A specification is not the run itself. It is the saved definition that an eval can reuse over time.

Why specifications create repeatability

Without specifications, testing often depends on who ran the prompts, which examples they remembered to try, and how they interpreted the outputs that day.

With specifications, the setup is explicit. That makes it easier to answer questions like:

did the agent improve after the latest change
did a prompt update introduce regressions
are two agents being tested against the same standard
can the team rerun the same workflow next week or next month

What repeatability looks like in practice

In practice, teams use specifications to separate evaluation setup from execution:

define the reusable specification once
attach it to one or more evals
run those evals when needed
compare results across repeated runs

That separation is what makes the workflow repeatable. The same specification can be reused even when the agent changes.

Precision also depends on scoring

Repeatability becomes much more useful when the scoring method is also explicit.

Spec27 supports several evaluation methods so the specification can match the task:

exact matching when there is one correct output
permitted values when multiple outputs are acceptable
judge-based scoring when correctness requires interpretation

This makes the specification both reusable and precise enough to support comparison over time.

Example: reusing the same specification across versions

Suppose a team updates an agent's prompt and wants to know whether behavior improved or regressed.

Instead of repeating a fresh round of manual checks, they can:

keep the existing specification
attach the updated agent to the eval
rerun the workflow
review the results against earlier runs

That gives the team stronger evidence than informal spot checks because the standard has stayed the same while the agent changed.

What a specification is​

Why specifications create repeatability​

What repeatability looks like in practice​

Precision also depends on scoring​

Example: reusing the same specification across versions​

Related pages​

What a specification is

Why specifications create repeatability

What repeatability looks like in practice

Precision also depends on scoring

Example: reusing the same specification across versions

Related pages