Glossary
Agent
Runnable logic that processes an input.
Dataset
A collection of test entries, examples, or rules used during evaluation.
Dataset category
An optional label on a dataset entry used to group related cases for later analysis.
Eval
A named evaluation setup that connects agents and specifications.
Judge
A scoring configuration used when correctness requires interpretation.
Organization
The top-level shared workspace for members, projects, and collaboration.
Playground
A faster exploratory run surface for trying configurations and generating derivative datasets.
Project
The main container for assets and results.
Result
A recorded output, score, or status produced during a run.
Red Team
Evaluation work focused on misuse, harmfulness, jailbreaks, and failure-seeking behavior.
Run
A single execution of an eval or playground workflow.
Secret
A protected project-level value used by an agent at runtime.
Specification
A reusable evaluation recipe that defines the datasets, attack methods, and related scoring setup for an eval.
Gold Team
Evaluation work focused on desirable behavior, correctness, and robustness.