Quickstart

This guide takes you through a complete evaluation from scratch: create a project, add an agent, prepare a specification with test entries, run an eval, and read the results. A first Gold Team workflow is the simplest place to start.

Before you begin

You can sign in to Spec27 and belong to an organization.
You know the behaviour you want to test.

1. Create a project

Open Projects and add a new project.
Give it a name, a short description, and a visibility setting.
Create the project.

The project's left navigation is where you move between Agents, Specifications, Evals, and Results.

2. Create an agent

An agent connects your system to Spec27. You have three options — pick whichever fits:

Agent Builder — describe the agent and have Spec27 generate the code.
Registry integrations — copy a prebuilt integration (OpenAI, Gemini, LangGraph, Botpress, and more) and supply your credentials.
Write the code yourself — author a small JavaScript client in the editor.

Declare any secrets the agent needs, then Preview it with a sample input to confirm it works before going further.

3. Create a specification

The specification is where the evaluation comes together. You'll author test entries (inputs and expected outputs), choose your scoring method, and optionally add robustness checks.

Open Specifications and add one, choosing Gold Team.
Name it and create it.
In the Entries tab, add test entries with Add entry (or Bulk import if you already have cases). Each entry needs an Input text and an Expected output; add a Category if you want grouped analysis later. Keep the first set small and representative. To expand coverage, you can generate entries using attack methods.
In the Evaluation tab, choose your Evaluation method:
- Strict equality — output must match the expected output exactly.
- Permitted values — output must be one of a list you provide.
- Judge — an LLM judge scores the output (see evaluation methods).
Optionally add attack methods to measure robustness. Leaving them unchecked runs only the primary entries.
Save the specification and wait for its status to reach Ready.

4. Create and run an eval

Open Evals and add an eval. Give it a name, then select your Agents and Specifications under Coverage.
Save the eval, then select Run eval from the eval detail page.
Watch the run status while it executes.

5. Review the results

Open the run to inspect the outcome. For each entry you can see whether it passed or failed, the actual output, and any judge explanation. Two headline numbers summarise the run:

Clean accuracy — performance on the primary entries.
Robust accuracy — the share of primary entries that stay correct across all of their adversarial variants (shown when the spec has attack coverage).

See Clean vs robust performance for how to read the two together, and Robustness for the trend across runs.

Where to next

Testing behaviour over a conversation? See Goal-based multi-turn evaluation.
Probing for misuse? See Red team evaluations.

Before you begin​

1. Create a project​

2. Create an agent​

3. Create a specification​

4. Create and run an eval​

5. Review the results​

Where to next​

Related pages​