Skip to main content

Mental Model

Spec27 is easiest to understand as a chain of reusable assets.

If the product feels broad at first, focus on one core idea: you create assets once, connect them into an evaluation workflow, run that workflow, and review the results later.

Organization

The organization is the top-level shared workspace. This is where memberships, roles, invite links, and usage limits live.

Project

The project is the main container for one evaluation stream. It holds the assets and results that belong together.

Assets inside a project

  • Datasets hold the test entries and expected behavior you want to evaluate.
  • Agents hold the runnable logic you want to test.
  • Secrets provide protected runtime values when an agent needs them.
  • Judges score outputs when exact matching is not enough.
  • Specifications define the reusable evaluation recipe.
  • Evals define the reusable runnable setup.

Runs and results

  • A run is a single execution of an eval or playground workflow.
  • Results are the recorded outputs, scores, statuses, and logs produced by that run.

The key idea: specifications are reusable

A specification defines what should be tested. It can include:

  • a primary dataset
  • one or more attack methods
  • one or more adversarial dataset selections
  • an optional judge-based evaluation setup

You can reuse the same specification across multiple evals. This makes it easier to test different agents against the same evaluation recipe or rerun the same checks after you change an agent.

Gold Team and Red Team use the same workflow

  • Gold Team specifications focus on desirable behavior, correctness, and robustness.
  • Red Team specifications focus on misuse, harmfulness, jailbreaks, and failure-seeking evaluation.
  • Both flows move through the same reusable chain of project assets, specifications, evals, runs, and results.

The workflow at a glance

Organization -> Project -> Datasets / Agents / Secrets / Judges -> Specification -> Eval -> Run -> Results