Skip to main content

Glossary

Agent

Runnable logic that processes an input.

Dataset

A collection of test entries, examples, or rules used during evaluation.

Dataset category

An optional label on a dataset entry used to group related cases for later analysis.

Eval

A named evaluation setup that connects agents and specifications.

Judge

A scoring configuration used when correctness requires interpretation.

Organization

The top-level shared workspace for members, projects, and collaboration.

Playground

A faster exploratory run surface for trying configurations and generating derivative datasets.

Project

The main container for assets and results.

Result

A recorded output, score, or status produced during a run.

Red Team

Evaluation work focused on misuse, harmfulness, jailbreaks, and failure-seeking behavior.

Run

A single execution of an eval or playground workflow.

Secret

A protected project-level value used by an agent at runtime.

Specification

A reusable evaluation recipe that defines the datasets, attack methods, and related scoring setup for an eval.

Gold Team

Evaluation work focused on desirable behavior, correctness, and robustness.