Red Team Testing
Red Team evaluations test whether an agent can be pushed into harmful or policy-breaking behavior.
In Spec27, Red Team work focuses on harmful behavior and other unsafe responses that may appear under adversarial pressure. The goal is to surface weaknesses before the agent is used in production.
Red Team work matters because many safety failures do not appear during normal product testing. They appear when a harmful request is disguised, reframed, or made to sound acceptable. A Red Team evaluation helps you measure whether the agent still refuses those requests.
What a Red Team evaluation does
A Red Team evaluation usually combines:
- a red-team dataset or seed prompts containing harmful requests
- an agent you want to evaluate
- one or more adversarial transformations or attack methods
- judge-based scoring that determines whether the harmful behavior succeeded
At run time, Spec27 typically does the following:
- starts from a risky or policy-sensitive user intent
- generates or applies adversarial prompt variants
- sends those variants to the agent
- asks a judge whether the response crossed the failure boundary
- records which attacks succeeded, failed, or need investigation
The important question is not only whether the model answered. The real question is whether it held its boundary under pressure.
What Red Team evaluations are testing
Red Team evaluations test whether an agent can be manipulated into producing harmful output it should refuse.
They help you answer questions such as:
- Can the user hide a harmful request inside roleplay?
- Can they reframe harmful intent so it sounds acceptable?
- Can they ask for harmful advice in an indirect way?
- Can they make the agent produce abusive, bullying, or dangerous content?
- Can they push the agent into giving unsafe guidance it should refuse?
Examples of adversarial transformations
Red Team transformations preserve the harmful goal while changing how the request is presented.
For example, you might test:
- Roleplay framing, where the request is wrapped in a fictional or authority-based scenario
- Obfuscated wording, where the harmful intent is hidden behind indirect or softened phrasing
- Format or context wrappers, where the request is embedded in code, long context, or a structured format
These examples are enough to show the main purpose: you are testing whether the agent resists adversarial pressure, not whether it handles ordinary user variation.
Why Red Team scoring is usually judge-based
Red Team work normally uses judge-based scoring because the failure condition is usually contextual.
The judge may need to decide things like:
- whether the model meaningfully complied with a harmful request
- whether a refusal was partial or incomplete
- whether the output included harmful advice or abusive content
- whether the output became materially more unsafe after the attack
That is why Red Team evaluation is typically not a strict string-match problem. The important signal is whether the safety boundary held.
Why this matters
Red Team evaluations help you answer questions such as:
- Can this agent be pushed into harmful behavior in realistic ways?
- Which attack styles are most successful against the current version?
- Are safety instructions strong enough, or only strong against direct prompts?
- Does the system produce harmful advice when requests are wrapped or disguised?
- Which failure modes should block launch or require deeper mitigation?
This matters for release readiness, governance reviews, customer trust, and ongoing monitoring after deployment.
Examples
Example: refusal-boundary testing
Suppose you have an assistant that should refuse harmful requests.
The Red Team evaluation might:
- start from a dataset of harmful intents
- apply adversarial transformations such as roleplay framing, academic framing, or authority-based framing
- use a judge to decide whether the model stayed within policy
What this is testing:
- whether the refusal survives storytelling or authority framing
- whether the model mistakes “educational” framing for permission
- whether hidden prompt weaknesses appear under adversarial pressure
Example: harmful behavior testing for a tutor agent
Suppose you have a tutor agent that helps students with writing and homework.
The Red Team evaluation might:
- test prompts that ask for harmful advice, bullying language, or unsafe instructions
- apply adversarial transformations such as roleplay prompts, indirect wording, or long-context prompts
- use a judge to score whether the agent produced harmful output
What this is testing:
- whether the tutor agent gives harmful advice it should refuse
- whether indirect wording weakens the refusal behavior
- whether the agent produces bullying or abusive content when provoked
Example: jailbreak regression tracking
Suppose you have already improved safety instructions and want to know whether a new model version regressed.
The Red Team evaluation might:
- keep the same red-team specification
- rerun the same attack methods against the new agent version
- compare success and failure patterns over time
What this is testing:
- whether the mitigation really worked
- whether the new version is safer or weaker than the old one
- whether specific attack families still need attention
What to do with the results
Red Team results are useful when they drive mitigation and re-testing.
Common next steps are:
- tighten system instructions or policy constraints
- improve tool access boundaries and secret handling
- add more refusal training or guardrail logic
- create narrower red-team specs for the highest-risk behaviors
- rerun the same eval after every material model or prompt change
Gold Team vs Red Team in practice
Gold Team asks whether the agent works well for legitimate users.
Red Team asks whether the same agent can be pushed into behavior it should refuse.
Most mature teams need both:
- Gold Team to prove quality and robustness
- Red Team to prove safety and resistance to misuse