What Spec27 Helps You Do

Spec27 helps you test AI agents with more rigor than one-off prompting, isolated notebook experiments, or manual spot checks.

Instead of asking a few prompts, scanning the outputs, and moving on, you can use Spec27 to define what should be tested, run the same evaluation again later, and review the evidence in one place. The product is built around reusable evaluation assets, explicit scoring, and saved results that stay connected to the setup that produced them.

That matters because testing an AI agent is not only about whether it answered one prompt correctly. You also need to know whether it stays correct across many inputs, remains robust when prompts vary, holds its safety boundaries under pressure, and improves over time instead of regressing silently.

What Spec27 helps teams do

You can use Spec27 to:

organize evaluation work inside shared Projects
author Specifications that define test cases, expected outputs, scoring rules, and evaluation recipes
add project Secrets when runtime access depends on protected values
configure robustness through attack methods and adversarial coverage inside the spec editor
save Evals as named, repeatable runnable setups that connect agents and specifications
inspect Results, logs, and run history over time

The key idea is repeatability. You can author a specification once with test cases and scoring rules, run the same eval again later, and check whether an updated agent actually got better.

How the product works in practice

Spec27 organizes testing as a chain of reusable assets rather than a loose collection of prompts.

Organization → Project → Specification → Eval → Run → Results

Each part of that chain has a clear role:

the Organization is the shared workspace for access, roles, and usage
the Project groups one evaluation stream and its assets together
a Specification is where you author test entries, define expected outputs, configure scoring rules, and set up attack methods for robustness testing
an Eval saves a runnable setup that connects an agent, project secrets, and a specification
a Run is one execution of that setup
Results preserve the outputs, scores, errors, and logs for later review

This structure improves testing quality because it keeps the inputs, scoring logic, execution setup, and outcomes tied together. Instead of losing context after a quick experiment, you can see exactly what was tested, how it was scored, and what changed between runs.

How Spec27 improves testing for AI agents

Spec27 helps you test AI agents in a more systematic way.

Move from single prompts to structured test entries

A few hand-picked prompts can tell you whether something looks promising, but they are weak evidence. Spec27 lets you author test entries inside a specification that represent the behaviors you actually care about, then reuse that specification across runs.

That makes testing broader and more honest. Instead of asking, "Did it answer this one prompt well?", you can ask, "How did it perform across the full set of behaviors we care about?"

Test robustness, not just happy-path correctness

Real users do not always type ideal prompts. They paraphrase, make typos, add distracting context, or ask things in unfamiliar formats. Spec27 supports robustness testing through reusable specifications, attack methods, and adversarial coverage.

That helps you learn whether an agent is only correct in the cleanest case or whether it still behaves correctly under realistic variation.

Reuse the same specification across agent versions

When you update prompts, switch models, or change agent logic, you need to know whether the change actually improved behavior. Spec27 makes that easier by separating the reusable evaluation recipe from the runnable setup.

You can reuse the same specification with updated agents and compare outcomes across versions instead of rebuilding the test manually every time.

Use explicit scoring instead of vague impressions

Some tasks need exact matching. Some allow multiple correct outputs. Some require a judge to interpret whether the answer met the standard. Spec27 supports strict equality, permitted values, and judge-based scoring so the evaluation method fits the task.

That produces clearer evidence than "this looked good when I tried it."

Preserve evidence, failure context, and results over time

Spec27 keeps runs attached to their evals and specifications. Results can include outputs, correctness, judge details, logs, statuses, and exportable data.

That gives you something reviewable and repeatable. If a failure appears, you can investigate it. If you make a change, you can rerun the same setup and see whether the problem was fixed.

What kinds of evaluation work Spec27 supports

Gold Team evaluations

Gold Team work focuses on desirable behavior, correctness, and robustness. In these evaluations, you are usually asking whether the agent does the right thing for legitimate tasks.

That can include testing:

whether an agent returns the correct answer
whether it stays correct across phrasing changes
whether typos or formatting noise cause failures
whether the agent remains stable when inputs are paraphrased or slightly harder

Gold Team work is useful when you want to improve product quality, reduce regressions, and build confidence that the agent works well for normal users.

Red Team evaluations

Red Team work focuses on misuse, harmfulness, jailbreaks, sensitive-information exposure, and other failure-seeking behaviors. In these evaluations, you are testing whether an agent can be pushed past the behavior boundary it is supposed to hold.

That can include testing:

whether the agent refuses unsafe requests
whether jailbreak-style prompting changes its behavior
whether roleplay or obfuscation causes unsafe compliance
whether the system leaks protected or sensitive information

Red Team work is useful when you want to understand where an agent is vulnerable, which attack styles are most effective, and what needs mitigation before release.

Concrete examples

Example: Gold Team support-agent testing

Suppose you have a support agent that answers policy questions. You author a specification with test entries for legitimate user questions, define the expected outputs, and run the same evaluation after each prompt or model update.

Then you add robustness methods such as paraphrasing, touchscreen typos, or typo-based perturbations to the specification. This helps you see whether the agent is only good on clean examples or whether it still performs well when real users ask the same question in messy ways.

Example: Red Team refusal-boundary testing

Suppose you have an assistant that should refuse harmful or disallowed requests. You author a red-team specification with test entries, configure adversarial methods such as roleplay or jailbreak-style prompting inside the spec editor, and set up scoring with a judge-based rule.

This helps you see whether the refusal behavior is actually strong or whether it only works against direct prompts and breaks under adversarial framing.

Example: Iteration after an agent change

You update an agent's prompt, tools, or underlying model. Instead of redoing the evaluation informally, you rerun the same eval in Spec27 and compare the new results with the previous run.

This helps you answer a concrete question: did the agent improve, regress, or simply change behavior in a way that needs more investigation?

What Spec27 helps teams do​

How the product works in practice​

How Spec27 improves testing for AI agents​

Move from single prompts to structured test entries​

Test robustness, not just happy-path correctness​

Reuse the same specification across agent versions​

Use explicit scoring instead of vague impressions​

Preserve evidence, failure context, and results over time​

What kinds of evaluation work Spec27 supports​

Gold Team evaluations​

Red Team evaluations​

Concrete examples​

Example: Gold Team support-agent testing​

Example: Red Team refusal-boundary testing​

Example: Iteration after an agent change​

What to read next​