What are Experiments?
Experiments let you systematically test different AI configurations against a baseline to find what works best. Instead of guessing which model, prompt, or parameter setting will perform better, you run controlled comparisons across a fixed set of tasks and let the results speak for themselves. Think of it as A/B testing for your AI pipeline — change one variable at a time, measure the impact, and make data-driven decisions about your configuration.Why use Experiments?
| Benefit | What it unlocks |
|---|---|
| Data-driven optimization | Stop guessing — measure which configuration actually performs better |
| Controlled comparisons | Test one variable at a time against a fixed baseline for clear results |
| Safe iteration | Try new models, prompts, or parameters without affecting production |
| Team collaboration | Share experiment results with stakeholders to justify configuration changes |
Key Concepts
Experiment
An experiment is a container that groups variants for comparison. Each experiment is scoped to specific inference steps and tasks, ensuring all variants are tested under the same conditions.Baseline
The baseline is your starting point — typically your current production configuration. All other variants are compared against it. Once the baseline runs, its configuration and task set are locked for the duration of the experiment.Variant
A variant represents a specific configuration change you want to test. Each variant modifies one parameter for one inference step, making it easy to isolate what caused any change in performance. Variants can also branch from other variants to explore incremental improvements.Inference Step
An inference step is a named, configurable point in your AI pipeline — for example, a specific LLM call likeresponse_generator or classifier. When you create an experiment, you select which inference steps are in scope for testing.
Inference steps are identified by their
externalId, which matches the step names in your application code and webhook payloads.Tasks
Tasks are the test cases that each variant runs against. By using the same set of tasks for every variant, you get a fair comparison. Tasks are selected when setting up the experiment and locked once the baseline starts running.How Experiments Work
Configure Experiment
Create a new experiment with a name and description. Select the inference steps you want to test — these are the configurable points in your AI pipeline that variants will override.
Select Tasks
Choose which tasks to include in the experiment. These tasks will be used to evaluate every variant, ensuring consistent comparison conditions.
Configure and Run Baseline
Set your baseline configuration — this is typically your current production setup. Run the baseline to establish your starting metrics. Once the baseline completes, the task set and parameter scope are locked.
Experiment Lifecycle
An experiment progresses through several stages:| Status | Description |
|---|---|
| Draft | Initial setup — configuring name, inference steps, and tasks |
| Baseline Pending | Tasks selected, ready to configure and run the baseline |
| Running Baseline | Baseline variant is executing against selected tasks |
| Baseline Complete | Baseline has results; ready to create and run variants |
| Baseline Failed | Baseline execution failed; review task configuration and retry |
| Running Variant | A variant is currently being tested |
| Idle | Waiting for you to create or run the next variant |
| Completed | All testing is done; review results and pick a winner |
| Archived | Experiment has been archived and is no longer active |
Variant Branching
Variants form a tree structure rooted at the baseline. You can:- Branch from the baseline to test a completely different value for a parameter
- Branch from another variant to make incremental changes on top of a previous experiment
temperature to 0.3 and Variant B (branched from A) changes model to gpt-4o, then Variant B’s effective config includes both changes.
Configurable Parameters
Each variant can override one of these parameters per inference step:| Parameter | Description | Example |
|---|---|---|
system | The system message / instructions | "You are a concise assistant." |
temperature | Randomness of the output | 0.3 |
top_p | Nucleus sampling threshold | 0.9 |
max_tokens | Maximum response length | 500 |
model | The LLM model to use | gpt-4o |
verbosity | Output detail level | low, medium, high |
Understanding Results
After each variant completes, you’ll see:- Pass Rate — percentage of tasks that passed (0–100%)
- Pass Rate vs Baseline — relative change compared to the baseline
- Task Breakdown — total, passed, and failed task counts
- Completed At — when the variant finished running
Webhook Integration
When a test runs as part of an experiment, the webhook payload includes anexperiment field with configuration overrides for each inference step. Your application applies these overrides before running the LLM call.
Example webhook payload with experiment
overrides object is keyed by inference step name. Each value contains the parameter overrides to apply for that step.
You don’t need to change your trace ingestion code. The
testId already links the trace back to the correct experiment variant. See the Webhooks guide for full implementation details.Best Practices
Start with a clear hypothesis Before creating a variant, write down what you expect to happen. For example: “Lowering temperature from 0.7 to 0.3 will reduce hallucinations and improve the pass rate.” Change one variable at a time Isolating a single parameter per variant makes it easy to attribute any performance change to that specific modification. Use enough tasks The more tasks you include, the more statistically meaningful your results will be. A small task set may produce noisy results. Branch to iterate If a variant shows promise, branch from it to test further refinements rather than starting over from the baseline. Document your findings Use the experiment description field to record your hypothesis, and review the results to confirm or refute it. This builds institutional knowledge about what works for your AI system.Getting Started
- Navigate to Experiments in your application dashboard
- Click New Experiment and give it a name and description
- Select the inference steps you want to test
- Choose tasks to evaluate against
- Configure and run your baseline
- Create variants to test different configurations
- Compare results and identify the best-performing setup
Webhooks
Learn how experiment overrides are delivered to your application
Evaluations
Understand how evaluations determine pass/fail for experiment tasks