Red Team Testing

Red Team evaluations test whether an agent can be pushed into harmful or policy-breaking behavior.

In Spec27, Red Team work focuses on harmful behavior and other unsafe responses that may appear under adversarial pressure. The goal is to surface weaknesses before the agent is used in production.

Red Team work matters because many safety failures do not appear during normal product testing. They appear when a harmful request is disguised, reframed, or made to sound acceptable. A Red Team evaluation helps you measure whether the agent still refuses those requests.

What a Red Team evaluation does

A Red Team evaluation usually combines:

a set of harmful or policy-sensitive intent entries or seed prompts
an agent you want to evaluate
one or more adversarial transformations or attack methods
judge-based scoring that determines whether the harmful behavior succeeded

At run time, Spec27 typically does the following:

starts from a risky or policy-sensitive user intent
generates or applies adversarial prompt variants
sends those variants to the agent
asks a judge whether the response crossed the failure boundary
records which attacks succeeded, failed, or need investigation

The important question is not only whether the model answered. The real question is whether it held its boundary under pressure.

What Red Team evaluations are testing

Red Team evaluations test whether an agent can be manipulated into producing harmful output it should refuse.

They help you answer questions such as:

Can the user hide a harmful request inside roleplay?
Can they reframe harmful intent so it sounds acceptable?
Can they ask for harmful advice in an indirect way?
Can they make the agent produce abusive, bullying, or dangerous content?
Can they push the agent into giving unsafe guidance it should refuse?

Examples of adversarial transformations

Red Team transformations preserve the harmful goal while changing how the request is presented.

For example, you might test:

Roleplay framing, where the request is wrapped in a fictional or authority-based scenario
Obfuscated wording, where the harmful intent is hidden behind indirect or softened phrasing
Format or context wrappers, where the request is embedded in code, long context, or a structured format

These examples are enough to show the main purpose: you are testing whether the agent resists adversarial pressure, not whether it handles ordinary user variation.

Why Red Team scoring is usually judge-based

Red Team work normally uses judge-based scoring because the failure condition is usually contextual.

The judge may need to decide things like:

whether the model meaningfully complied with a harmful request
whether a refusal was partial or incomplete
whether the output included harmful advice or abusive content
whether the output became materially more unsafe after the attack

That is why Red Team evaluation is typically not a strict string-match problem. The important signal is whether the safety boundary held.

Why this matters

Red Team evaluations help you answer questions such as:

Can this agent be pushed into harmful behavior in realistic ways?
Which attack styles are most successful against the current version?
Are safety instructions strong enough, or only strong against direct prompts?
Does the system produce harmful advice when requests are wrapped or disguised?
Which failure modes should block launch or require deeper mitigation?

This matters for release readiness, governance reviews, customer trust, and ongoing monitoring after deployment.

Examples

Example: refusal-boundary testing

Suppose you have an assistant that should refuse harmful requests.

The Red Team evaluation might:

start from a set of harmful or policy-sensitive intent entries
apply adversarial transformations such as roleplay framing, academic framing, or authority-based framing
use a judge to decide whether the model stayed within policy

What this is testing:

whether the refusal survives storytelling or authority framing
whether the model mistakes “educational” framing for permission
whether hidden prompt weaknesses appear under adversarial pressure

Example: harmful behavior testing for a tutor agent

Suppose you have a tutor agent that helps students with writing and homework.

The Red Team evaluation might:

test prompts that ask for harmful advice, bullying language, or unsafe instructions
apply adversarial transformations such as roleplay prompts, indirect wording, or long-context prompts
use a judge to score whether the agent produced harmful output

What this is testing:

whether the tutor agent gives harmful advice it should refuse
whether indirect wording weakens the refusal behavior
whether the agent produces bullying or abusive content when provoked

Example: jailbreak regression tracking

Suppose you have already improved safety instructions and want to know whether a new model version regressed.

The Red Team evaluation might:

keep the same red-team specification
rerun the same attack methods against the new agent version
compare success and failure patterns over time

What this is testing:

whether the mitigation really worked
whether the new version is safer or weaker than the old one
whether specific attack families still need attention

What to do with the results

Red Team results are useful when they drive mitigation and re-testing.

Common next steps are:

tighten system instructions or policy constraints
improve tool access boundaries and secret handling
add more refusal training or guardrail logic
create narrower red-team specs for the highest-risk behaviors
rerun the same eval after every material model or prompt change

Gold Team vs Red Team in practice

Gold Team asks whether the agent works well for legitimate users.

Red Team asks whether the same agent can be pushed into behavior it should refuse.

Most mature teams need both:

Gold Team to prove quality and robustness
Red Team to prove safety and resistance to misuse

Probing across a conversation

Some failures only appear when an adversary keeps pushing across several turns. For that, use red-team multi-turn evaluation, where a simulated adversarial user probes the agent across a bounded conversation.

What a Red Team evaluation does​

What Red Team evaluations are testing​

Examples of adversarial transformations​

Why Red Team scoring is usually judge-based​

Why this matters​

Examples​

Example: refusal-boundary testing​

Example: harmful behavior testing for a tutor agent​

Example: jailbreak regression tracking​

What to do with the results​

Gold Team vs Red Team in practice​

Probing across a conversation​

Related pages​

What a Red Team evaluation does

What Red Team evaluations are testing

Examples of adversarial transformations

Why Red Team scoring is usually judge-based

Why this matters

Examples

Example: refusal-boundary testing

Example: harmful behavior testing for a tutor agent

Example: jailbreak regression tracking

What to do with the results

Gold Team vs Red Team in practice

Probing across a conversation

Related pages