Skip to main content

What are Experiments?

Experiments let you systematically test different AI configurations against a baseline to find what works best. Instead of guessing which model, prompt, or parameter setting will perform better, you run controlled comparisons across a fixed set of tasks and let the results speak for themselves. Think of it as A/B testing for your AI pipeline — change one variable at a time, measure the impact, and make data-driven decisions about your configuration.

Why use Experiments?

BenefitWhat it unlocks
Data-driven optimizationStop guessing — measure which configuration actually performs better
Controlled comparisonsTest one variable at a time against a fixed baseline for clear results
Safe iterationTry new models, prompts, or parameters without affecting production
Team collaborationShare experiment results with stakeholders to justify configuration changes

Key Concepts

Experiment

An experiment is a container that groups variants for comparison. Each experiment is scoped to specific inference steps and tasks, ensuring all variants are tested under the same conditions.

Baseline

The baseline is your starting point — typically your current production configuration. All other variants are compared against it. Once the baseline runs, its configuration and task set are locked for the duration of the experiment.

Variant

A variant represents a specific configuration change you want to test. Each variant modifies one parameter for one inference step, making it easy to isolate what caused any change in performance. Variants can also branch from other variants to explore incremental improvements.

Inference Step

An inference step is a named, configurable point in your AI pipeline — for example, a specific LLM call like response_generator or classifier. When you create an experiment, you select which inference steps are in scope for testing.
Inference steps are identified by their externalId, which matches the step names in your application code and webhook payloads.

Tasks

Tasks are the test cases that each variant runs against. By using the same set of tasks for every variant, you get a fair comparison. Tasks are selected when setting up the experiment and locked once the baseline starts running.

How Experiments Work

1

Configure Experiment

Create a new experiment with a name and description. Select the inference steps you want to test — these are the configurable points in your AI pipeline that variants will override.
2

Select Tasks

Choose which tasks to include in the experiment. These tasks will be used to evaluate every variant, ensuring consistent comparison conditions.
3

Configure and Run Baseline

Set your baseline configuration — this is typically your current production setup. Run the baseline to establish your starting metrics. Once the baseline completes, the task set and parameter scope are locked.
4

Create and Compare Variants

Create variants that change a single parameter (e.g., a different temperature, model, or system prompt). Each variant runs against the same tasks, and results are compared to the baseline automatically.

Experiment Lifecycle

An experiment progresses through several stages:
StatusDescription
DraftInitial setup — configuring name, inference steps, and tasks
Baseline PendingTasks selected, ready to configure and run the baseline
Running BaselineBaseline variant is executing against selected tasks
Baseline CompleteBaseline has results; ready to create and run variants
Baseline FailedBaseline execution failed; review task configuration and retry
Running VariantA variant is currently being tested
IdleWaiting for you to create or run the next variant
CompletedAll testing is done; review results and pick a winner
ArchivedExperiment has been archived and is no longer active

Variant Branching

Variants form a tree structure rooted at the baseline. You can:
  • Branch from the baseline to test a completely different value for a parameter
  • Branch from another variant to make incremental changes on top of a previous experiment
Each variant applies a config patch — a partial override that changes specific parameters for specific inference steps. The effective configuration is computed by walking the chain from the baseline through each parent, overlaying patches in order. For example, if Variant A changes temperature to 0.3 and Variant B (branched from A) changes model to gpt-4o, then Variant B’s effective config includes both changes.

Configurable Parameters

Each variant can override one of these parameters per inference step:
ParameterDescriptionExample
systemThe system message / instructions"You are a concise assistant."
temperatureRandomness of the output0.3
top_pNucleus sampling threshold0.9
max_tokensMaximum response length500
modelThe LLM model to usegpt-4o
verbosityOutput detail levellow, medium, high

Understanding Results

After each variant completes, you’ll see:
  • Pass Rate — percentage of tasks that passed (0–100%)
  • Pass Rate vs Baseline — relative change compared to the baseline
  • Task Breakdown — total, passed, and failed task counts
  • Completed At — when the variant finished running
When the experiment is complete, Avido highlights the best-performing variant so you can quickly identify the winning configuration.

Webhook Integration

When a test runs as part of an experiment, the webhook payload includes an experiment field with configuration overrides for each inference step. Your application applies these overrides before running the LLM call.
Example webhook payload with experiment
{
  "prompt": "Write a concise onboarding email.",
  "testId": "123e4567-e89b-12d3-a456-426614174000",
  "experiment": {
    "experimentId": "aaa11111-bbbb-cccc-dddd-eeeeeeeeeeee",
    "experimentVariantId": "fff22222-3333-4444-5555-666666666666",
    "overrides": {
      "response_generator": {
        "temperature": 0.3,
        "system": "You are a concise assistant."
      }
    }
  }
}
The overrides object is keyed by inference step name. Each value contains the parameter overrides to apply for that step.
You don’t need to change your trace ingestion code. The testId already links the trace back to the correct experiment variant. See the Webhooks guide for full implementation details.

Best Practices

Start with a clear hypothesis Before creating a variant, write down what you expect to happen. For example: “Lowering temperature from 0.7 to 0.3 will reduce hallucinations and improve the pass rate.” Change one variable at a time Isolating a single parameter per variant makes it easy to attribute any performance change to that specific modification. Use enough tasks The more tasks you include, the more statistically meaningful your results will be. A small task set may produce noisy results. Branch to iterate If a variant shows promise, branch from it to test further refinements rather than starting over from the baseline. Document your findings Use the experiment description field to record your hypothesis, and review the results to confirm or refute it. This builds institutional knowledge about what works for your AI system.

Getting Started

  1. Navigate to Experiments in your application dashboard
  2. Click New Experiment and give it a name and description
  3. Select the inference steps you want to test
  4. Choose tasks to evaluate against
  5. Configure and run your baseline
  6. Create variants to test different configurations
  7. Compare results and identify the best-performing setup

Webhooks

Learn how experiment overrides are delivered to your application

Evaluations

Understand how evaluations determine pass/fail for experiment tasks