Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.avidoai.com/llms.txt

Use this file to discover all available pages before exploring further.

Why four numbers, not one?

A pass rate on its own can mislead you.
  • 80% from 5 tests could really be anywhere between 30% and 99%. You don’t have enough data to tell.
  • 80% from 500 tests is a real 80%.
  • 80% with scores tightly clustered is predictable.
  • 80% with scores all over the place will surprise you in production.
The pass rate doesn’t change in any of these — but your willingness to ship should. Avido gives you four metrics so you can see the full story, not just the headline.
MetricWhat it tells you
Pass RateHow often the task passes.
EvaluationsHow many times we’ve tested it.
ConfidenceWhether you can trust the pass rate.
StabilityWhether individual runs are consistent.

The verdict

You don’t have to read four numbers. Avido turns them into a one-line verdict at the top of every task.
ConfidenceStabilityVerdict
🟢 High🟢 HighTrustworthy and consistent — ready to ship.
🟢 High🔴 LowThe average is reliable, but individual runs vary. Expect surprises in production.
🔴 Low🟢 HighLooks consistent so far, but not enough data to be confident yet. Run more evaluations.
🔴 Low🔴 LowNot enough signal. Run more evaluations and investigate the variance.
When either side is 🟡 Medium, the verdict says so — “Confidence is moderate” or “Stability is moderate” — so you know exactly which number to focus on.

What the badges mean

Pass Rate

The percentage of evaluations that passed. The headline number — only useful when Confidence is also High.

Evaluations

How many tests have run. Avido looks at the last 30 days by default. More evaluations means more confidence in the pass rate.

Confidence

Can I trust this pass rate?
BadgeMeaning
🟢 HighPlenty of data. The pass rate is reliable.
🟡 MediumEnough data to be useful, but the true pass rate could still drift from what you see.
🔴 LowToo few evaluations to draw a real conclusion yet. Run more.
When Confidence is Medium or Low, Avido tells you how many more evaluations you need to reach High — “Run ~120 more evaluations to reach High confidence.”

Stability

Are the individual runs consistent?
BadgeMeaning
🟢 HighScores cluster tightly. Behaviour in production should match the average.
🟡 MediumSome variation between runs. Worth a look.
🔴 LowScores swing widely. The average looks fine, but individual users will get very different experiences.
Stability needs at least two evaluations to compute. With fewer, you’ll see instead of a badge.

Why both matter

A high pass rate is necessary, but not enough. Confidence tells you whether the number is real. Stability tells you whether the average matches the experience. A task at 85% pass rate with High Confidence and Low Stability means the average customer interaction is fine — but a meaningful share of them won’t be. That’s a different decision from “this is great, ship it.”

Where you see metrics

On the task list

Every task shows Pass Rate, Confidence, Stability, and a combined Health Score as columns. Filter by any of them — for example, every task at Low Confidence — to find the ones that need more testing.

Inside a task

Open any task and you’ll see metrics at the top of the Synthetic and Monitoring tabs.
  • Synthetic — results from your scheduled and manual tests (controlled test data).
  • Monitoring — results from real production traffic.
Each tab has its own verdict and its own numbers. Avido doesn’t combine them, because test data and live traffic are different populations — mixing them would give you a misleading average. Below the verdict, a per-evaluation breakdown shows the same metrics for each evaluation on the task. That’s where you spot which one is dragging the headline down — accuracy might be excellent while tone is volatile, or vice versa.

The Explain drawer

Click any Confidence or Stability badge and an Explain drawer slides out:
  1. A plain-English answer at the top.
  2. The inputs — exactly what data the badge was computed from.
  3. Why this badge — the threshold and where this task sits in it.
  4. What would change this — how many more evaluations are needed, or where to investigate volatility.
  5. Show math — for the curious, the formulas with your numbers filled in.
The drawer is for understanding, not configuration.

On the Insights page

Three headline cards roll up across your whole application:
  • “X% of tasks have high confidence” — how much of your suite you can trust today.
  • “N tasks need more testing” — count of tasks at Low or Medium confidence.
  • “N tasks have unstable scores” — count of tasks at Low stability.
Each card links straight through to a filtered task list.

Health Score

When you just want one number. Health Score blends Pass Rate, Confidence, and Stability into a single 0–100 score:
ScoreLevel
90–100🟢 High
60–89🟡 Medium
0–59🔴 Low
Pass Rate carries the most weight (60%). Confidence and Stability split the rest, 20% each.
If Confidence is Low, the Health Score shows as . There’s no point rolling up a pass rate that isn’t trustworthy yet — run more evaluations and the score will reappear.

Common questions

What if there’s no data yet? The verdict and metrics stay hidden until evaluations exist. Add a task and run some tests. What if the pass rate is 100%? You can still have Low Confidence if it’s based on only a handful of runs. The verdict will tell you to run more before celebrating. Why are Synthetic and Monitoring shown separately? Test data and real production traffic look different by design. Pooling them produces a number that doesn’t mean much. Switch tabs to see each side cleanly. Why aren’t experiments in here? Experiments are controlled comparisons, not your baseline quality measurement. They live in their own area so they don’t skew the headline numbers. Can I change the thresholds? Not in this version. They’re tuned to work out of the box for most teams.

How the maths works

You don’t need to read this section. Avido has done the maths. It’s here for the curious — and for compliance reviewers who want to verify the approach.

Confidence

Confidence is the margin of error around the pass rate, computed as a 95% Wilson score interval. Wilson stays well-behaved at small sample sizes and at extreme pass rates (close to 0% or 100%), where the simpler textbook formula falsely reports perfect confidence. The badge comes from the size of the error bar:
Margin of errorLevel
≤ ±5 percentage points🟢 High
±5 – ±15 percentage points🟡 Medium
> ±15 percentage points🔴 Low
The Explain drawer shows the full formula with your numbers substituted in.

Stability

Stability is the spread of individual scores around the average, measured by sample standard deviation on a 0–1 scale.
SpreadLevel
≤ 0.10🟢 High
0.10 – 0.25🟡 Medium
> 0.25🔴 Low
Most evaluation types already produce a 0–1 score. Naturalness and Style evaluations produce 1–5 — these are converted to 0–1 before measuring spread, so every evaluation contributes on the same scale.

Health Score

healthScore = 0.6 × passRate + 0.2 × confidenceScore + 0.2 × stabilityScore
All three inputs are on a 0–100 scale. passRate is the percentage (e.g. 94 for 94%, not 0.94). confidenceScore and stabilityScore are mapped from the level: High = 100, Medium = 60, Low = 0. For a task at 94% pass rate, High confidence, and High stability:
healthScore = 0.6 × 94 + 0.2 × 100 + 0.2 × 100 = 96.4
Suppressed when Confidence is Low.

Time window

All four metrics use the last 30 days by default. The window is fixed inside a task drawer; the Insights page lets you change it.

Need help?

For the evaluation types that feed these metrics, see Evaluations.