Runs and Results

Use Results to understand what happened during a run and what to change next.

Open a run

Open a run from the eval detail page (View results) or from the project Results area. While a run is in progress, the page shows its status and current step; you can leave and come back as it completes.

Read the headline metrics

A completed run summarises performance with two numbers:

Clean accuracy — performance on the primary entries, shown with the count behind it (for example "X / Y correct").
Robust accuracy — the share of primary entries that stayed correct across all of their adversarial variants. It appears once the specification has attack coverage.

See Clean vs robust performance for how to interpret the gap between them, and Robustness for the trend across runs.

Inspect per-case results

The per-case table shows each entry's input, expected output, actual output, and a result badge:

Pass — scored fully correct.
Fail — scored incorrect.
Error (Agent) — the agent failed to produce an output.
Error (Judge) — scoring failed.

Select a case to open its details: the primary input (and adversarial input for attacked rows), the expected and actual output, the correctness score, and the judge explanation for judge-scored runs. For goal-based multi-turn runs the columns become Goal, Target, and Conversation trace, and the details show the full turn-by-turn exchange.

Filter to what matters

Narrow the table to focus on the cases you care about:

by attack method,
by category (when your entries use categories),
by result — All, Non robust cases, Failed primaries, or Failures only, and
by searching input and output text.

Decide what to do next

After reviewing, you should be able to tell whether the run succeeded, which cases failed, whether failures cluster around a category or attack method, and whether judge explanations point to a clear fix. Iterate on the agent or specification, then re-run and compare.

Open a run​

Read the headline metrics​

Inspect per-case results​

Filter to what matters​

Decide what to do next​

Related pages​

Open a run

Read the headline metrics

Inspect per-case results

Filter to what matters

Decide what to do next

Related pages