What Spec27 Helps You Do
Spec27 helps you test AI agents with more rigor than one-off prompting, isolated notebook experiments, or manual spot checks.
Instead of asking a few prompts, scanning the outputs, and moving on, you can use Spec27 to define what should be tested, run the same evaluation again later, and review the evidence in one place. The product is built around reusable evaluation assets, explicit scoring, and saved results that stay connected to the setup that produced them.
That matters because testing an AI agent is not only about whether it answered one prompt correctly. You also need to know whether it stays correct across many inputs, remains robust when prompts vary, holds its safety boundaries under pressure, and improves over time instead of regressing silently.
What Spec27 helps teams do
You can use Spec27 to:
- organize evaluation work inside shared Projects
- store test cases, expected outputs, and categories in reusable Datasets
- save runnable agent logic as Agents
- add project Secrets when runtime access depends on protected values
- configure Judges when outputs need interpretation instead of exact matching
- define reusable Specifications that say what should be tested and how
- save Evals as named, repeatable runnable setups
- inspect Results, logs, and run history over time
The key idea is repeatability. You can define a dataset once, save a specification once, run the same eval again later, and check whether an updated agent actually got better.
How the product works in practice
Spec27 organizes testing as a chain of reusable assets rather than a loose collection of prompts.
Organization -> Project -> Datasets / Agents / Secrets / Judges -> Specification -> Eval -> Run -> Results
Each part of that chain has a clear role:
- the Organization is the shared workspace for access, roles, and usage
- the Project groups one evaluation stream and its assets together
- Datasets define the test inputs and expected behavior
- Agents define the runnable system you want to evaluate
- Secrets support runtime access when the agent needs protected values
- Judges provide scoring when correctness is too nuanced for exact matching
- a Specification defines the evaluation recipe
- an Eval saves a runnable setup that connects agents and specifications
- a Run is one execution of that setup
- Results preserve the outputs, scores, errors, and logs for later review
This structure improves testing quality because it keeps the inputs, scoring logic, execution setup, and outcomes tied together. Instead of losing context after a quick experiment, you can see exactly what was tested, how it was scored, and what changed between runs.
How Spec27 improves testing for AI agents
Spec27 helps you test AI agents in a more systematic way.
Move from single prompts to structured datasets
A few hand-picked prompts can tell you whether something looks promising, but they are weak evidence. Spec27 lets you build datasets that represent the behaviors you actually care about, then reuse those datasets across runs.
That makes testing broader and more honest. Instead of asking, "Did it answer this one prompt well?", you can ask, "How did it perform across the full set of behaviors we care about?"
Test robustness, not just happy-path correctness
Real users do not always type ideal prompts. They paraphrase, make typos, add distracting context, or ask things in unfamiliar formats. Spec27 supports robustness testing through reusable specifications, attack methods, and adversarial coverage.
That helps you learn whether an agent is only correct in the cleanest case or whether it still behaves correctly under realistic variation.
Reuse the same specification across agent versions
When you update prompts, switch models, or change agent logic, you need to know whether the change actually improved behavior. Spec27 makes that easier by separating the reusable evaluation recipe from the runnable setup.
You can reuse the same specification with updated agents and compare outcomes across versions instead of rebuilding the test manually every time.
Use explicit scoring instead of vague impressions
Some tasks need exact matching. Some allow multiple correct outputs. Some require a judge to interpret whether the answer met the standard. Spec27 supports strict equality, permitted values, and judge-based scoring so the evaluation method fits the task.
That produces clearer evidence than "this looked good when I tried it."
Preserve evidence, failure context, and results over time
Spec27 keeps runs attached to their evals and specifications. Results can include outputs, correctness, judge details, logs, statuses, and exportable data.
That gives you something reviewable and repeatable. If a failure appears, you can investigate it. If you make a change, you can rerun the same setup and see whether the problem was fixed.
What kinds of evaluation work Spec27 supports
Gold Team evaluations
Gold Team work focuses on desirable behavior, correctness, and robustness. In these evaluations, you are usually asking whether the agent does the right thing for legitimate tasks.
That can include testing:
- whether an agent returns the correct answer
- whether it stays correct across phrasing changes
- whether typos or formatting noise cause failures
- whether the agent remains stable when inputs are paraphrased or slightly harder
Gold Team work is useful when you want to improve product quality, reduce regressions, and build confidence that the agent works well for normal users.
Red Team evaluations
Red Team work focuses on misuse, harmfulness, jailbreaks, sensitive-information exposure, and other failure-seeking behaviors. In these evaluations, you are testing whether an agent can be pushed past the behavior boundary it is supposed to hold.
That can include testing:
- whether the agent refuses unsafe requests
- whether jailbreak-style prompting changes its behavior
- whether roleplay or obfuscation causes unsafe compliance
- whether the system leaks protected or sensitive information
Red Team work is useful when you want to understand where an agent is vulnerable, which attack styles are most effective, and what needs mitigation before release.
Concrete examples
Example: Gold Team support-agent testing
Suppose you have a support agent that answers policy questions. You create a dataset of legitimate user questions, define the expected outputs, and run the same evaluation after each prompt or model update.
Then you add robustness methods such as paraphrasing, texting-style prompts, or typo-based perturbations. This helps you see whether the agent is only good on clean examples or whether it still performs well when real users ask the same question in messy ways.
Example: Red Team refusal-boundary testing
Suppose you have an assistant that should refuse harmful or disallowed requests. You create a red-team specification, use adversarial methods such as roleplay or jailbreak-style prompting, and score the outputs with a judge.
This helps you see whether the refusal behavior is actually strong or whether it only works against direct prompts and breaks under adversarial framing.
Example: Iteration after an agent change
You update an agent's prompt, tools, or underlying model. Instead of redoing the evaluation informally, you rerun the same eval in Spec27 and compare the new results with the previous run.
This helps you answer a concrete question: did the agent improve, regress, or simply change behavior in a way that needs more investigation?