Documentation Index
Fetch the complete documentation index at: https://docs.avidoai.com/llms.txt
Use this file to discover all available pages before exploring further.
Why four numbers, not one?
A pass rate on its own can mislead you.
- 80% from 5 tests could really be anywhere between 30% and 99%. You don’t have enough data to tell.
- 80% from 500 tests is a real 80%.
- 80% with scores tightly clustered is predictable.
- 80% with scores all over the place will surprise you in production.
The pass rate doesn’t change in any of these — but your willingness to ship should. Avido gives you four metrics so you can see the full story, not just the headline.
| Metric | What it tells you |
|---|
| Pass Rate | How often the task passes. |
| Evaluations | How many times we’ve tested it. |
| Confidence | Whether you can trust the pass rate. |
| Stability | Whether individual runs are consistent. |
The verdict
You don’t have to read four numbers. Avido turns them into a one-line verdict at the top of every task.
| Confidence | Stability | Verdict |
|---|
| 🟢 High | 🟢 High | Trustworthy and consistent — ready to ship. |
| 🟢 High | 🔴 Low | The average is reliable, but individual runs vary. Expect surprises in production. |
| 🔴 Low | 🟢 High | Looks consistent so far, but not enough data to be confident yet. Run more evaluations. |
| 🔴 Low | 🔴 Low | Not enough signal. Run more evaluations and investigate the variance. |
When either side is 🟡 Medium, the verdict says so — “Confidence is moderate” or “Stability is moderate” — so you know exactly which number to focus on.
What the badges mean
Pass Rate
The percentage of evaluations that passed. The headline number — only useful when Confidence is also High.
Evaluations
How many tests have run. Avido looks at the last 30 days by default. More evaluations means more confidence in the pass rate.
Confidence
Can I trust this pass rate?
| Badge | Meaning |
|---|
| 🟢 High | Plenty of data. The pass rate is reliable. |
| 🟡 Medium | Enough data to be useful, but the true pass rate could still drift from what you see. |
| 🔴 Low | Too few evaluations to draw a real conclusion yet. Run more. |
When Confidence is Medium or Low, Avido tells you how many more evaluations you need to reach High — “Run ~120 more evaluations to reach High confidence.”
Stability
Are the individual runs consistent?
| Badge | Meaning |
|---|
| 🟢 High | Scores cluster tightly. Behaviour in production should match the average. |
| 🟡 Medium | Some variation between runs. Worth a look. |
| 🔴 Low | Scores swing widely. The average looks fine, but individual users will get very different experiences. |
Stability needs at least two evaluations to compute. With fewer, you’ll see — instead of a badge.
Why both matter
A high pass rate is necessary, but not enough. Confidence tells you whether the number is real. Stability tells you whether the average matches the experience.
A task at 85% pass rate with High Confidence and Low Stability means the average customer interaction is fine — but a meaningful share of them won’t be. That’s a different decision from “this is great, ship it.”
Where you see metrics
On the task list
Every task shows Pass Rate, Confidence, Stability, and a combined Health Score as columns. Filter by any of them — for example, every task at Low Confidence — to find the ones that need more testing.
Inside a task
Open any task and you’ll see metrics at the top of the Synthetic and Monitoring tabs.
- Synthetic — results from your scheduled and manual tests (controlled test data).
- Monitoring — results from real production traffic.
Each tab has its own verdict and its own numbers. Avido doesn’t combine them, because test data and live traffic are different populations — mixing them would give you a misleading average.
Below the verdict, a per-evaluation breakdown shows the same metrics for each evaluation on the task. That’s where you spot which one is dragging the headline down — accuracy might be excellent while tone is volatile, or vice versa.
The Explain drawer
Click any Confidence or Stability badge and an Explain drawer slides out:
- A plain-English answer at the top.
- The inputs — exactly what data the badge was computed from.
- Why this badge — the threshold and where this task sits in it.
- What would change this — how many more evaluations are needed, or where to investigate volatility.
- Show math — for the curious, the formulas with your numbers filled in.
The drawer is for understanding, not configuration.
On the Insights page
Three headline cards roll up across your whole application:
- “X% of tasks have high confidence” — how much of your suite you can trust today.
- “N tasks need more testing” — count of tasks at Low or Medium confidence.
- “N tasks have unstable scores” — count of tasks at Low stability.
Each card links straight through to a filtered task list.
Health Score
When you just want one number. Health Score blends Pass Rate, Confidence, and Stability into a single 0–100 score:
| Score | Level |
|---|
| 90–100 | 🟢 High |
| 60–89 | 🟡 Medium |
| 0–59 | 🔴 Low |
Pass Rate carries the most weight (60%). Confidence and Stability split the rest, 20% each.
If Confidence is Low, the Health Score shows as —. There’s no point rolling up a pass rate that isn’t trustworthy yet — run more evaluations and the score will reappear.
Common questions
What if there’s no data yet?
The verdict and metrics stay hidden until evaluations exist. Add a task and run some tests.
What if the pass rate is 100%?
You can still have Low Confidence if it’s based on only a handful of runs. The verdict will tell you to run more before celebrating.
Why are Synthetic and Monitoring shown separately?
Test data and real production traffic look different by design. Pooling them produces a number that doesn’t mean much. Switch tabs to see each side cleanly.
Why aren’t experiments in here?
Experiments are controlled comparisons, not your baseline quality measurement. They live in their own area so they don’t skew the headline numbers.
Can I change the thresholds?
Not in this version. They’re tuned to work out of the box for most teams.
How the maths works
You don’t need to read this section. Avido has done the maths. It’s here for the curious — and for compliance reviewers who want to verify the approach.
Confidence
Confidence is the margin of error around the pass rate, computed as a 95% Wilson score interval. Wilson stays well-behaved at small sample sizes and at extreme pass rates (close to 0% or 100%), where the simpler textbook formula falsely reports perfect confidence.
The badge comes from the size of the error bar:
| Margin of error | Level |
|---|
| ≤ ±5 percentage points | 🟢 High |
| ±5 – ±15 percentage points | 🟡 Medium |
| > ±15 percentage points | 🔴 Low |
The Explain drawer shows the full formula with your numbers substituted in.
Stability
Stability is the spread of individual scores around the average, measured by sample standard deviation on a 0–1 scale.
| Spread | Level |
|---|
| ≤ 0.10 | 🟢 High |
| 0.10 – 0.25 | 🟡 Medium |
| > 0.25 | 🔴 Low |
Most evaluation types already produce a 0–1 score. Naturalness and Style evaluations produce 1–5 — these are converted to 0–1 before measuring spread, so every evaluation contributes on the same scale.
Health Score
healthScore = 0.6 × passRate + 0.2 × confidenceScore + 0.2 × stabilityScore
All three inputs are on a 0–100 scale. passRate is the percentage (e.g. 94 for 94%, not 0.94). confidenceScore and stabilityScore are mapped from the level: High = 100, Medium = 60, Low = 0.
For a task at 94% pass rate, High confidence, and High stability:
healthScore = 0.6 × 94 + 0.2 × 100 + 0.2 × 100 = 96.4
Suppressed when Confidence is Low.
Time window
All four metrics use the last 30 days by default. The window is fixed inside a task drawer; the Insights page lets you change it.
Need help?
For the evaluation types that feed these metrics, see Evaluations.