Gold Team Testing

Gold Team evaluations test whether an agent continues to behave correctly for normal, legitimate user requests.

In Spec27, Gold Team work focuses on desirable behavior, correctness, and robustness. The goal is not to break the system. The goal is to check whether the agent remains reliable when real users ask valid questions in different ways.

This is important because production traffic rarely looks identical to a clean demo prompt. Users paraphrase, make typing mistakes, add extra context, and phrase the same request in many different ways. A Gold Team evaluation helps you measure whether the agent still reaches the expected outcome under those natural variations.

What a Gold Team evaluation does

A Gold Team evaluation usually combines:

a primary set of normal test cases you care about
an agent you want to evaluate
an expected output for each entry
an evaluation method that decides whether the output passed
optional perturbations that make the input harder without changing the legitimate user intent

At run time, Spec27 typically does the following:

takes a user input from your primary test cases
optionally applies one or more perturbations to that input
sends the original or perturbed input to the agent
compares the output with the expected output
records which examples still passed and which became fragile

This gives you evidence about two things:

whether the agent is correct on the original task
whether the agent remains correct when the same task is phrased differently

What Gold Team evaluations are testing

Gold Team evaluations test robustness to natural user variation, not misuse.

They help you answer questions such as:

Does the agent still answer correctly when the wording changes?
Does it remain correct when the user makes small spelling mistakes?
Does it stay accurate when the request includes extra or distracting context?
Does it preserve the expected behavior when the request is restated?

You do not need a malicious attack to expose weakness. Many failures appear when a legitimate user asks the same question in a slightly different form.

Examples of perturbations

Gold Team perturbations are designed to preserve the user goal while changing the surface form of the request.

For example, you might test:

Paraphrasing, where the same request is rewritten in different words
Typo or formatting noise, where the request contains spelling mistakes, spacing issues, or small input errors
Extra context or wording changes, where the request includes irrelevant detail or a different sentence structure

These examples are enough to show the main purpose: you are testing whether the agent is robust to natural user queries, not whether it survives a hostile jailbreak attempt.

How Gold Team scoring works

Gold Team specs can use all three evaluation methods in Spec27.

Strict equality

Use this when there is one exact correct answer.

Good fit for:

exact extraction
deterministic formatting
single canonical outputs

Permitted values

Use this when more than one answer is acceptable, but the answer space is still controlled.

Good fit for:

classification labels
finite action names
approved variants of a response

Judge-based scoring

Use this when correctness depends on interpretation.

Good fit for:

rubric-based response quality
partial-credit reasoning tasks
nuanced policy-following checks

Gold Team work often starts with strict equality or permitted values, then moves to judge-based scoring when the task becomes more open-ended.

Why this matters

Gold Team evaluations help you answer practical questions such as:

Is this agent ready for release?
Which prompts or categories are still brittle?
Did the new model or code change introduce regressions?
Is the agent only good on the original test cases, or also on realistic user variation?
Which behaviors should we fix before we scale usage?

This is especially valuable when teams are comparing agent versions, validating launch readiness, or building confidence that a workflow is robust enough for real customers.

Examples

Example: support agent robustness

Suppose you have a customer-support agent that classifies refund requests.

The Gold Team evaluation might:

use a set of legitimate refund question test cases
score with permitted values such as approve, deny, or needs_review
apply perturbations such as paraphrasing, touchscreen typos, or other typos

What this is testing:

whether the classifier still routes correctly when wording changes
whether small spelling mistakes break the decision
whether mobile-style writing causes regressions

Example: FAQ assistant correctness

Suppose you have an FAQ agent that should answer policy questions consistently.

The Gold Team evaluation might:

use exact expected answers for stable policy facts
test baseline prompts first
add perturbations such as sentence rewrites or extra context

What this is testing:

whether the agent knows the core answer
whether it keeps the same answer under paraphrase
whether irrelevant context causes a wrong answer

Example: travel workflow stability

Suppose you have an assistant that helps users understand travel information.

The Gold Team evaluation might:

use a set of itinerary or rules question test cases
score with a judge when exact wording is not important
apply perturbations such as non-ideal phrasing, different writing styles, or minor formatting noise

What this is testing:

whether the assistant still understands non-ideal phrasing
whether professional or non-native-language style changes the outcome
whether formatting noise causes the workflow to degrade

What to do with the results

Gold Team results are most useful when you turn them into action.

Common next steps are:

fix prompt or agent logic for failed categories
tighten response formatting when exact outputs matter
add better instructions, tools, or retrieval grounding
split one broad eval into narrower specs so failures are easier to diagnose
keep the same eval and rerun it after changes to confirm improvement

Evaluating over a conversation

When the behaviour you care about only emerges across several turns — not from a single prompt and response — use goal-based multi-turn evaluation, the Gold Team flow where a simulated user drives the conversation toward a defined goal.

What a Gold Team evaluation does​

What Gold Team evaluations are testing​

Examples of perturbations​

How Gold Team scoring works​

Strict equality​

Permitted values​

Judge-based scoring​

Why this matters​

Examples​

Example: support agent robustness​

Example: FAQ assistant correctness​

Example: travel workflow stability​

What to do with the results​

Evaluating over a conversation​

Related pages​

What a Gold Team evaluation does

What Gold Team evaluations are testing

Examples of perturbations

How Gold Team scoring works

Strict equality

Permitted values

Judge-based scoring

Why this matters

Examples

Example: support agent robustness

Example: FAQ assistant correctness

Example: travel workflow stability

What to do with the results

Evaluating over a conversation

Related pages