Skip to main content

Quickstart

Use this guide when you want to create your own project and build a complete evaluation workflow from scratch.

This walkthrough follows the same product flow as the onboarding example, but instead of inspecting preloaded assets, you will create your own.

Before You Begin

  • You can sign in to Spec27.
  • You belong to an organization.
  • You know the behavior you want to test.

Create a New Project

  1. Open Projects.
  2. Select Add Project or New project.
  3. On the project creation page, choose Start blank.
  4. Enter a project name.
  5. Add a short description that explains what the project is for.
  6. Choose the project visibility.
  7. Create the project.

After the project opens, use the left navigation menu to move between the main project areas:

  • Datasets
  • Agents
  • Secrets
  • Judges
  • Specs
  • Evals
  • Results

The project overview page is the main control point for the project. It helps you confirm what assets already exist and what still needs to be created.

Choose Your First Workflow

For a first project, the simplest path is usually a Gold Team workflow.

That means you will:

  • create a gold-team dataset with valid user inputs
  • define the expected outputs
  • connect an agent
  • choose an evaluation method
  • create a specification and eval
  • run the eval and review the results

If your goal is failure-seeking or misuse testing instead, follow the same flow but choose Red Team wherever the app asks for a team type.

Create and Populate a Dataset

  1. In the left navigation, open Datasets.
  2. Select Add Dataset.
  3. Choose the team type for the dataset:
    • Gold Team for expected and desirable behavior
    • Red Team for misuse, failure modes, or attack-oriented examples
  4. Enter a dataset name.
  5. Add a description.
  6. Confirm the dataset classification fields.
  7. Create the dataset.

After the dataset is created, open the dataset detail page and populate it.

From the dataset detail page, use:

  • Add entry to add entries one at a time
  • Bulk import if you already have source material in CSV form

For a typical gold-team dataset, each entry should include:

  • the user input
  • the expected output
  • an optional category if you want grouped analysis later

For a first project, keep the dataset small and focused. A short set of representative examples is easier to review and improve than a large mixed dataset.

Review the Dataset Structure

After adding entries, review the dataset detail page.

Use it to confirm:

  • the dataset team type
  • the dataset kind
  • the number of entries
  • the entry table contents
  • whether any adversarial datasets exist yet

If the dataset is a primary dataset, the detail page can also show related adversarial datasets later as your project grows.

Create an Agent

  1. In the left navigation, open Agents.
  2. Create a new agent.
  3. Give the agent a clear name.
  4. Add the agent code content.
  5. Save the agent.

Agent content is written in JavaScript as a small REST client, following the pattern:

return async function process(input) {
const response = await fetch("https://your-service.example.com/respond", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({ input }),
});

const data = await response.json();
return data.output;
}

This keeps the agent focused on one job:

  • receive the dataset input
  • send that input to your REST endpoint
  • read the response
  • return the final output that Spec27 should evaluate

When you write the agent, make sure:

  • the exported function accepts the input you expect from the dataset
  • the request body matches your API contract
  • the returned value is the final text or result you want scored

After the agent is created, open the agent detail page and review:

  • the current agent content
  • the rate-limit configuration
  • whether the agent is already used in any evals

If you want to test the agent before wiring it into an eval, open Preview from the agent page.

Create a Specification

  1. In the left navigation, open Specs.
  2. Select Add Specification.
  3. Choose the team type for the specification.
  4. Enter a specification name.
  5. Add any additional context if the workflow needs it.
  6. Select the primary dataset.
  7. Review the evaluation settings shown in the form.
  8. Save the specification.

The specification form changes based on the selected team type.

For a Gold Team specification, you typically choose:

  • the primary dataset
  • the evaluation method
  • the execution mode
  • permitted values, if you use that evaluation method
  • a judge in the specification form, if you use judge-based evaluation
  • any attack methods or adversarial dataset selections

For a Red Team specification:

  • judge-based evaluation is fixed
  • a baseline judge is created and assigned automatically
  • the form focuses on the primary dataset and attack coverage

You do not need to create judges separately before creating a specification. Add or confirm judge usage from the specification form when the workflow requires judge-based scoring.

After saving the specification, review its detail page and confirm that the status is progressing normally. A specification may move through states such as Preparing, Ready, or Failed.

Create an Eval

  1. In the left navigation, open Evals.
  2. Select Add Eval.
  3. Enter an eval name.
  4. Add a description.
  5. Select one or more agents.
  6. Select one or more specifications.
  7. Save the eval.

After the eval is created, open the eval detail page and review:

  • the assigned agents
  • the assigned specifications
  • the description
  • the schedule section
  • the results summary

At this stage, the eval becomes the saved workflow that connects your assets into something you can run again later.

Run the Eval

Open the eval results area when you are ready to execute the workflow.

You can do this in either of these ways:

  1. Open the eval detail page and select View results.
  2. Open Results from the project navigation and find the eval there.

From the eval results surface, use Run eval to start a run.

After the run starts, monitor:

  • the current status
  • clean accuracy
  • robust accuracy, when adversarial coverage exists
  • the run history

Review the Results

Open the latest run to inspect the detailed outcome.

Use the run detail page to review:

  • whether the run completed successfully
  • which cases passed or failed
  • the output for each dataset entry
  • any judge explanations or votes
  • any logs or errors

The results views are also where you compare clean and robust performance over time.

Clean performance is the standard accuracy on the primary dataset.

Robust performance is the percentage of primary examples that remained correct across all of their adversarial variants.