> ## Documentation Index
> Fetch the complete documentation index at: https://docs.avidoai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Metrics

> Four numbers that tell you whether to trust your AI's quality — and what to do when you can't.

## Why four numbers, not one?

A pass rate on its own can mislead you.

* **80% from 5 tests** could really be anywhere between 30% and 99%. You don't have enough data to tell.
* **80% from 500 tests** is a real 80%.
* **80% with scores tightly clustered** is predictable.
* **80% with scores all over the place** will surprise you in production.

The pass rate doesn't change in any of these — but your willingness to ship should. Avido gives you four metrics so you can see the full story, not just the headline.

| Metric          | What it tells you                       |
| --------------- | --------------------------------------- |
| **Pass Rate**   | How often the task passes.              |
| **Evaluations** | How many times we've tested it.         |
| **Confidence**  | Whether you can trust the pass rate.    |
| **Stability**   | Whether individual runs are consistent. |

***

## The verdict

You don't have to read four numbers. Avido turns them into a one-line verdict at the top of every task.

| Confidence | Stability | Verdict                                                                                     |
| ---------- | --------- | ------------------------------------------------------------------------------------------- |
| 🟢 High    | 🟢 High   | **Trustworthy and consistent — ready to ship.**                                             |
| 🟢 High    | 🔴 Low    | **The average is reliable, but individual runs vary.** Expect surprises in production.      |
| 🔴 Low     | 🟢 High   | **Looks consistent so far, but not enough data to be confident yet.** Run more evaluations. |
| 🔴 Low     | 🔴 Low    | **Not enough signal.** Run more evaluations and investigate the variance.                   |

When either side is **🟡 Medium**, the verdict says so — *"Confidence is moderate"* or *"Stability is moderate"* — so you know exactly which number to focus on.

***

## What the badges mean

### Pass Rate

The percentage of evaluations that passed. The headline number — only useful when Confidence is also High.

### Evaluations

How many tests have run. Avido looks at the last 30 days by default. More evaluations means more confidence in the pass rate.

### Confidence

*Can I trust this pass rate?*

| Badge         | Meaning                                                                               |
| ------------- | ------------------------------------------------------------------------------------- |
| 🟢 **High**   | Plenty of data. The pass rate is reliable.                                            |
| 🟡 **Medium** | Enough data to be useful, but the true pass rate could still drift from what you see. |
| 🔴 **Low**    | Too few evaluations to draw a real conclusion yet. Run more.                          |

When Confidence is Medium or Low, Avido tells you how many more evaluations you need to reach High — *"Run \~120 more evaluations to reach High confidence."*

### Stability

*Are the individual runs consistent?*

| Badge         | Meaning                                                                                                |
| ------------- | ------------------------------------------------------------------------------------------------------ |
| 🟢 **High**   | Scores cluster tightly. Behaviour in production should match the average.                              |
| 🟡 **Medium** | Some variation between runs. Worth a look.                                                             |
| 🔴 **Low**    | Scores swing widely. The average looks fine, but individual users will get very different experiences. |

<Info>
  Stability needs at least two evaluations to compute. With fewer, you'll see `—` instead of a badge.
</Info>

***

## Why both matter

A high pass rate is necessary, but not enough. Confidence tells you whether the number is real. Stability tells you whether the **average** matches the **experience**.

A task at 85% pass rate with High Confidence and Low Stability means *the average customer interaction is fine — but a meaningful share of them won't be.* That's a different decision from "this is great, ship it."

***

## Where you see metrics

### On the task list

Every task shows Pass Rate, Confidence, Stability, and a combined **Health Score** as columns. Filter by any of them — for example, every task at Low Confidence — to find the ones that need more testing.

### Inside a task

Open any task and you'll see metrics at the top of the **Synthetic** and **Monitoring** tabs.

* **Synthetic** — results from your scheduled and manual tests (controlled test data).
* **Monitoring** — results from real production traffic.

Each tab has its own verdict and its own numbers. Avido doesn't combine them, because test data and live traffic are different populations — mixing them would give you a misleading average.

Below the verdict, a **per-evaluation breakdown** shows the same metrics for each evaluation on the task. That's where you spot which one is dragging the headline down — accuracy might be excellent while tone is volatile, or vice versa.

### The Explain drawer

Click any Confidence or Stability badge and an Explain drawer slides out:

1. **A plain-English answer** at the top.
2. **The inputs** — exactly what data the badge was computed from.
3. **Why this badge** — the threshold and where this task sits in it.
4. **What would change this** — how many more evaluations are needed, or where to investigate volatility.
5. **Show math** — for the curious, the formulas with your numbers filled in.

The drawer is for understanding, not configuration.

### On the Insights page

Three headline cards roll up across your whole application:

* **"X% of tasks have high confidence"** — how much of your suite you can trust today.
* **"N tasks need more testing"** — count of tasks at Low or Medium confidence.
* **"N tasks have unstable scores"** — count of tasks at Low stability.

Each card links straight through to a filtered task list.

***

## Health Score

When you just want one number. Health Score blends Pass Rate, Confidence, and Stability into a single 0–100 score:

| Score  | Level     |
| ------ | --------- |
| 90–100 | 🟢 High   |
| 60–89  | 🟡 Medium |
| 0–59   | 🔴 Low    |

Pass Rate carries the most weight (60%). Confidence and Stability split the rest, 20% each.

<Info>
  If Confidence is Low, the Health Score shows as `—`. There's no point rolling up a pass rate that isn't trustworthy yet — run more evaluations and the score will reappear.
</Info>

***

## Common questions

**What if there's no data yet?**
The verdict and metrics stay hidden until evaluations exist. Add a [task](/evaluations) and run some tests.

**What if the pass rate is 100%?**
You can still have Low Confidence if it's based on only a handful of runs. The verdict will tell you to run more before celebrating.

**Why are Synthetic and Monitoring shown separately?**
Test data and real production traffic look different by design. Pooling them produces a number that doesn't mean much. Switch tabs to see each side cleanly.

**Why aren't experiments in here?**
[Experiments](/experiments) are controlled comparisons, not your baseline quality measurement. They live in their own area so they don't skew the headline numbers.

**Can I change the thresholds?**
Not in this version. They're tuned to work out of the box for most teams.

***

## How the maths works

You don't need to read this section. Avido has done the maths. It's here for the curious — and for compliance reviewers who want to verify the approach.

### Confidence

Confidence is the **margin of error around the pass rate**, computed as a 95% **Wilson score interval**. Wilson stays well-behaved at small sample sizes and at extreme pass rates (close to 0% or 100%), where the simpler textbook formula falsely reports perfect confidence.

The badge comes from the size of the error bar:

| Margin of error            | Level     |
| -------------------------- | --------- |
| ≤ ±5 percentage points     | 🟢 High   |
| ±5 – ±15 percentage points | 🟡 Medium |
| > ±15 percentage points    | 🔴 Low    |

The Explain drawer shows the full formula with your numbers substituted in.

### Stability

Stability is the **spread of individual scores around the average**, measured by sample standard deviation on a 0–1 scale.

| Spread      | Level     |
| ----------- | --------- |
| ≤ 0.10      | 🟢 High   |
| 0.10 – 0.25 | 🟡 Medium |
| > 0.25      | 🔴 Low    |

Most evaluation types already produce a 0–1 score. **Naturalness** and **Style** evaluations produce 1–5 — these are converted to 0–1 before measuring spread, so every evaluation contributes on the same scale.

### Health Score

```
healthScore = 0.6 × passRate + 0.2 × confidenceScore + 0.2 × stabilityScore
```

All three inputs are on a 0–100 scale. `passRate` is the percentage (e.g. 94 for 94%, not 0.94). `confidenceScore` and `stabilityScore` are mapped from the level: High = 100, Medium = 60, Low = 0.

For a task at 94% pass rate, High confidence, and High stability:

```
healthScore = 0.6 × 94 + 0.2 × 100 + 0.2 × 100 = 96.4
```

Suppressed when Confidence is Low.

### Time window

All four metrics use the **last 30 days** by default. The window is fixed inside a task drawer; the Insights page lets you change it.

***

## Need help?

* **Email** — [support@avidoai.com](mailto:support@avidoai.com)

For the evaluation types that feed these metrics, see [Evaluations](/evaluations).
