Experiments - Avido

What are Experiments?

Experiments let you systematically test different AI configurations against a baseline to find what works best. Instead of guessing which model, prompt, or parameter setting will perform better, you run controlled comparisons across a fixed set of tasks and let the results speak for themselves. Think of it as A/B testing for your AI pipeline — change one variable at a time, measure the impact, and make data-driven decisions about your configuration.

Why use Experiments?

Benefit	What it unlocks
Data-driven optimization	Stop guessing — measure which configuration actually performs better
Controlled comparisons	Test one variable at a time against a fixed baseline for clear results
Safe iteration	Try new models, prompts, or parameters without affecting production
Team collaboration	Share experiment results with stakeholders to justify configuration changes

Key Concepts

Experiment

An experiment is a container that groups variants for comparison. Each experiment is scoped to specific inference steps and tasks, ensuring all variants are tested under the same conditions.

Baseline

The baseline is your starting point — typically your current production configuration. All other variants are compared against it. Once the baseline runs, its configuration and task set are locked for the duration of the experiment.

Variant

A variant represents a specific configuration change you want to test. Each variant modifies one parameter for one inference step, making it easy to isolate what caused any change in performance. Variants can also branch from other variants to explore incremental improvements.

Inference Step

An inference step is a named, configurable point in your AI pipeline — for example, a specific LLM call like response_generator or classifier. When you create an experiment, you select which inference steps are in scope for testing.

Inference steps are identified by their externalId, which matches the step names in your application code and webhook payloads.

Tasks

Tasks are the test cases that each variant runs against. By using the same set of tasks for every variant, you get a fair comparison. Tasks are selected when setting up the experiment and locked once the baseline starts running.

How Experiments Work

Configure Experiment

Create a new experiment with a name and description. Select the inference steps you want to test — these are the configurable points in your AI pipeline that variants will override.

Select Tasks

Choose which tasks to include in the experiment. These tasks will be used to evaluate every variant, ensuring consistent comparison conditions.

Configure and Run Baseline

Set your baseline configuration — this is typically your current production setup. Run the baseline to establish your starting metrics. Once the baseline completes, the task set and parameter scope are locked.

Create and Compare Variants

Create variants that change a single parameter (e.g., a different temperature, model, or system prompt). Each variant runs against the same tasks, and results are compared to the baseline automatically.

Experiment Lifecycle

An experiment progresses through several stages:

Status	Description
Draft	Initial setup — configuring name, inference steps, and tasks
Baseline Pending	Tasks selected, ready to configure and run the baseline
Running Baseline	Baseline variant is executing against selected tasks
Baseline Complete	Baseline has results; ready to create and run variants
Baseline Failed	Baseline execution failed; review task configuration and retry
Running Variant	A variant is currently being tested
Idle	Waiting for you to create or run the next variant
Completed	All testing is done; review results and pick a winner
Archived	Experiment has been archived and is no longer active

Variant Branching

Variants form a tree structure rooted at the baseline. You can:

Branch from the baseline to test a completely different value for a parameter
Branch from another variant to make incremental changes on top of a previous experiment

Each variant applies a config patch — a partial override that changes specific parameters for specific inference steps. The effective configuration is computed by walking the chain from the baseline through each parent, overlaying patches in order. For example, if Variant A changes temperature to 0.3 and Variant B (branched from A) changes model to gpt-4o, then Variant B’s effective config includes both changes.

Configurable Parameters

Each variant can override one of these parameters per inference step:

Parameter	Description	Example
`system`	The system message / instructions	`"You are a concise assistant."`
`temperature`	Randomness of the output	`0.3`
`top_p`	Nucleus sampling threshold	`0.9`
`max_tokens`	Maximum response length	`500`
`model`	The LLM model to use	`gpt-4o`
`verbosity`	Output detail level	`low`, `medium`, `high`

Understanding Results

After each variant completes, you’ll see:

Pass Rate — percentage of tasks that passed (0–100%)
Pass Rate vs Baseline — relative change compared to the baseline
Task Breakdown — total, passed, and failed task counts
Completed At — when the variant finished running

When the experiment is complete, Avido highlights the best-performing variant so you can quickly identify the winning configuration.

Webhook Integration

When a test runs as part of an experiment, the webhook payload includes an experiment field with configuration overrides for each inference step. Your application applies these overrides before running the LLM call.

Example webhook payload with experiment

{
  "prompt": "Write a concise onboarding email.",
  "testId": "123e4567-e89b-12d3-a456-426614174000",
  "experiment": {
    "experimentId": "aaa11111-bbbb-cccc-dddd-eeeeeeeeeeee",
    "experimentVariantId": "fff22222-3333-4444-5555-666666666666",
    "overrides": {
      "response_generator": {
        "temperature": 0.3,
        "system": "You are a concise assistant."
      }
    }
  }
}

The overrides object is keyed by inference step name. Each value contains the parameter overrides to apply for that step.

You don’t need to change your trace ingestion code. The testId already links the trace back to the correct experiment variant. See the Webhooks guide for full implementation details.

Best Practices

Start with a clear hypothesis Before creating a variant, write down what you expect to happen. For example: “Lowering temperature from 0.7 to 0.3 will reduce hallucinations and improve the pass rate.” Change one variable at a time Isolating a single parameter per variant makes it easy to attribute any performance change to that specific modification. Use enough tasks The more tasks you include, the more statistically meaningful your results will be. A small task set may produce noisy results. Branch to iterate If a variant shows promise, branch from it to test further refinements rather than starting over from the baseline. Document your findings Use the experiment description field to record your hypothesis, and review the results to confirm or refute it. This builds institutional knowledge about what works for your AI system.

Getting Started

Navigate to Experiments in your application dashboard
Click New Experiment and give it a name and description
Select the inference steps you want to test
Choose tasks to evaluate against
Configure and run your baseline
Create variants to test different configurations
Compare results and identify the best-performing setup

Webhooks

Learn how experiment overrides are delivered to your application

Evaluations

Understand how evaluations determine pass/fail for experiment tasks

​What are Experiments?

​Why use Experiments?

​Key Concepts

​Experiment

​Baseline

​Variant

​Inference Step

​Tasks

​How Experiments Work

​Experiment Lifecycle

​Variant Branching

​Configurable Parameters

​Understanding Results

​Webhook Integration

​Best Practices

​Getting Started

Webhooks

Evaluations

What are Experiments?

Why use Experiments?

Key Concepts

Experiment

Baseline

Variant

Inference Step

Tasks

How Experiments Work

Experiment Lifecycle

Variant Branching

Configurable Parameters

Understanding Results

Webhook Integration

Best Practices

Getting Started