Gold Team Testing
Gold Team evaluations test whether an agent continues to behave correctly for normal, legitimate user requests.
In Spec27, Gold Team work focuses on desirable behavior, correctness, and robustness. The goal is not to break the system. The goal is to check whether the agent remains reliable when real users ask valid questions in different ways.
This is important because production traffic rarely looks identical to a clean demo prompt. Users paraphrase, make typing mistakes, add extra context, and phrase the same request in many different ways. A Gold Team evaluation helps you measure whether the agent still reaches the expected outcome under those natural variations.
What a Gold Team evaluation does
A Gold Team evaluation usually combines:
- a primary dataset of normal tasks you care about
- an agent you want to evaluate
- an expected output for each dataset example
- an evaluation method that decides whether the output passed
- optional perturbations that make the input harder without changing the legitimate user intent
At run time, Spec27 typically does the following:
- takes a user input from the dataset
- optionally applies one or more perturbations to that input
- sends the original or perturbed input to the agent
- compares the output with the expected output
- records which examples still passed and which became fragile
This gives you evidence about two things:
- whether the agent is correct on the original task
- whether the agent remains correct when the same task is phrased differently
What Gold Team evaluations are testing
Gold Team evaluations test robustness to natural user variation, not misuse.
They help you answer questions such as:
- Does the agent still answer correctly when the wording changes?
- Does it remain correct when the user makes small spelling mistakes?
- Does it stay accurate when the request includes extra or distracting context?
- Does it preserve the expected behavior when the request is restated?
You do not need a malicious attack to expose weakness. Many failures appear when a legitimate user asks the same question in a slightly different form.
Examples of perturbations
Gold Team perturbations are designed to preserve the user goal while changing the surface form of the request.
For example, you might test:
- Paraphrasing, where the same request is rewritten in different words
- Typo or formatting noise, where the request contains spelling mistakes, spacing issues, or small input errors
- Extra context or wording changes, where the request includes irrelevant detail or a different sentence structure
These examples are enough to show the main purpose: you are testing whether the agent is robust to natural user queries, not whether it survives a hostile jailbreak attempt.
How Gold Team scoring works
Gold Team specs can use all three evaluation methods in Spec27.
Strict equality
Use this when there is one exact correct answer.
Good fit for:
- exact extraction
- deterministic formatting
- single canonical outputs
Permitted values
Use this when more than one answer is acceptable, but the answer space is still controlled.
Good fit for:
- classification labels
- finite action names
- approved variants of a response
Judge-based scoring
Use this when correctness depends on interpretation.
Good fit for:
- rubric-based response quality
- partial-credit reasoning tasks
- nuanced policy-following checks
Gold Team work often starts with strict equality or permitted values, then moves to judge-based scoring when the task becomes more open-ended.
Why this matters
Gold Team evaluations help you answer practical questions such as:
- Is this agent ready for release?
- Which prompts or categories are still brittle?
- Did the new model or code change introduce regressions?
- Is the agent only good on the clean dataset, or also on realistic user variation?
- Which behaviors should we fix before we scale usage?
This is especially valuable when teams are comparing agent versions, validating launch readiness, or building confidence that a workflow is robust enough for real customers.
Examples
Example: support agent robustness
Suppose you have a customer-support agent that classifies refund requests.
The Gold Team evaluation might:
- use a dataset of legitimate refund questions
- score with permitted values such as
approve,deny, orneeds_review - apply perturbations such as paraphrasing, texting-style phrasing, or typos
What this is testing:
- whether the classifier still routes correctly when wording changes
- whether small spelling mistakes break the decision
- whether mobile-style writing causes regressions
Example: FAQ assistant correctness
Suppose you have an FAQ agent that should answer policy questions consistently.
The Gold Team evaluation might:
- use exact expected answers for stable policy facts
- test baseline prompts first
- add perturbations such as sentence rewrites or extra context
What this is testing:
- whether the agent knows the core answer
- whether it keeps the same answer under paraphrase
- whether irrelevant context causes a wrong answer
Example: travel workflow stability
Suppose you have an assistant that helps users understand travel information.
The Gold Team evaluation might:
- use a dataset of itinerary or rules questions
- score with a judge when exact wording is not important
- apply perturbations such as non-ideal phrasing, different writing styles, or minor formatting noise
What this is testing:
- whether the assistant still understands non-ideal phrasing
- whether professional or non-native-language style changes the outcome
- whether formatting noise causes the workflow to degrade
What to do with the results
Gold Team results are most useful when you turn them into action.
Common next steps are:
- fix prompt or agent logic for failed categories
- tighten response formatting when exact outputs matter
- add better instructions, tools, or retrieval grounding
- split one broad eval into narrower specs so failures are easier to diagnose
- keep the same eval and rerun it after changes to confirm improvement