Skip to main content

Overview

Avido provides six evaluation types to measure different aspects of AI quality. Each evaluation is applied to tasks, then runs automatically when those tasks execute, providing actionable insights when quality standards aren’t met.
Evaluation TypePurposeScore RangePass Threshold
NaturalnessHuman-like communication quality1-53.5
StyleBrand guideline compliance1-53.5
RecallRAG pipeline performance0-10.5
Fact CheckerFactual accuracy vs ground truth0-10.8
CustomDomain-specific criteria0-10.5

Naturalness

Measures how natural, engaging, and clear your AI’s responses are to users.

What It Evaluates

The Naturalness evaluation assesses five dimensions of response quality:
  • Coherence – Logical flow and consistency of ideas
  • Engagingness – Ability to capture and maintain user interest
  • Naturalness – Human-like language and tone
  • Relevance – On-topic responses that address the user’s intent
  • Clarity – Clear, understandable language without ambiguity

How It Works

An LLM evaluates your AI’s response across all five dimensions on a 1-5 scale. The overall score is the average of these dimensions. Pass Criteria:
  • All five dimensions must score ≥ 3.5
  • Average score ≥ 3.5
This ensures no single dimension fails even if the overall average is high.

Example Results

CoherenceEngagingnessNaturalnessRelevanceClarityOverallResult
555555.0✅ Pass
444444.0✅ Pass
555524.4❌ Fail (Clarity < 3.5)

When to Use

  • Conversational AI and chatbots
  • Customer support automation
  • Content generation systems
  • Any user-facing AI interactions

Style

Evaluates whether responses adhere to your organization’s style guidelines and brand voice.

What It Evaluates

A single comprehensive score (1-5) based on your custom style guide, measuring:
  • Tone and voice consistency
  • Terminology usage
  • Format and structure requirements
  • Brand-specific guidelines
  • Reading level and complexity

How It Works

You provide a style guide document that defines your brand’s communication standards. An LLM evaluates each response against this guide and provides:
  • A score from 1-5
  • Detailed analysis explaining the rating
Pass Criteria:
  • Score ≥ 3.5

Example Style Guide Elements

# Customer Support Style Guide

**Tone:** Professional yet friendly, never casual
**Voice:** Active voice preferred, clear and direct
**Terminology:** Use "account" not "profile", "transfer" not "send"
**Format:** Start with acknowledgment, provide solution, end with offer to help
**Constraints:** Keep responses under 100 words when possible

When to Use

  • Brand-critical communications
  • Multi-channel consistency (chat, email, voice)
  • Customer-facing applications where brand matters
Note: For regulated industries with strict compliance requirements, use Custom evaluations instead.

Recall (RAG Evaluation)

Comprehensive evaluation of Retrieval-Augmented Generation (RAG) pipeline quality.

What It Evaluates

Four metrics that measure different aspects of RAG performance:
  • Context Relevancy – Are retrieved documents relevant to the query?
  • Context Precision – How well-ranked are the retrieved documents?
  • Faithfulness – Is the answer grounded in the retrieved context?
  • Answer Relevancy – Does the answer address the user’s question?

How It Works

Each metric produces a score from 0-1 (higher is better). The overall score is the average of Context Precision, Faithfulness, and Answer Relevancy. Pass Criteria:
  • Context Precision ≥ 0.5
  • Faithfulness ≥ 0.5
  • Answer Relevancy ≥ 0.5
Note: Context Relevancy is computed for observability but doesn’t affect pass/fail status.

Score Interpretation

Score RangeInterpretationAction Required
0.8 - 1.0Excellent performanceMonitor
0.5 - 0.8Acceptable qualityOptimize if critical
0.0 - 0.5Poor performanceInvestigate immediately

Common Issues and Solutions

Low MetricLikely CauseSolution
Context PrecisionToo many irrelevant chunks retrievedReduce top_k, improve filters
Context RelevancyEmbedding/index driftRetrain embeddings, update index
FaithfulnessModel hallucinatingAdd grounding instructions, reduce temperature
Answer RelevancyAnswer drifts off-topicImprove prompt focus, add constraints

When to Use

  • Knowledge base search and retrieval
  • Document Q&A systems
  • RAG pipelines
  • Any system combining retrieval with generation

Fact Checker

Validates factual accuracy of AI responses against ground truth.

What It Evaluates

Compares AI-generated statements with known correct information, classifying each statement as:
  • True Positives (TP) – Correct facts present in the response
  • False Positives (FP) – Incorrect facts in the response
  • False Negatives (FN) – Correct facts omitted from the response

How It Works

An LLM extracts factual statements from both the AI response and ground truth, then classifies them. The F1 score measures accuracy:
F1 = TP / (TP + 0.5 × (FP + FN))
Pass Criteria:
  • F1 score ≥ 0.8
This allows high-quality answers with minor omissions while maintaining strict accuracy standards.

Example Classification

Question: “What powers the sun?” Ground Truth: “The sun is powered by nuclear fusion. In its core, hydrogen atoms fuse to form helium, releasing tremendous energy.” AI Response: “The sun is powered by nuclear fission, similar to nuclear reactors, and provides light to the solar system.” Classification:
  • TP: [“Provides light to the solar system”]
  • FP: [“Powered by nuclear fission”, “Similar to nuclear reactors”]
  • FN: [“Powered by nuclear fusion”, “Hydrogen fuses to form helium”]
  • F1 Score: 0.20 → ❌ Fail

Score Examples

TPFPFNF1 ScoreResultNotes
5001.0✅ PassPerfect accuracy
5010.91✅ PassMinor omission acceptable
5100.91✅ PassMinor error acceptable
4100.8✅ PassBoundary case
3020.75❌ FailToo many omissions
1400.33❌ FailMostly incorrect

When to Use

  • Financial data and calculations
  • Medical or legal information
  • Product specifications and features
  • Any domain where factual accuracy is critical

Custom

Create domain-specific evaluations for your unique business requirements.

What It Evaluates

Whatever you define in a custom criterion. Common use cases:
  • Regulatory compliance checks
  • Schema or format validation
  • Latency or performance SLAs
  • Business logic requirements
  • Security and privacy rules

How It Works

You provide a criterion describing what to check. An LLM evaluates the response and returns:
  • Binary pass/fail (1 or 0)
  • Reasoning explaining the decision
Pass Criteria:
  • Score = 1 (criterion met)

Example Criteria

# Compliance Example
"The response must not mention specific account numbers, 
social security numbers, or other PII. Pass if no PII is present."

# Format Example
"The response must be formatted as a JSON object with 
'action', 'parameters', and 'reasoning' keys. Pass if valid JSON 
with all required keys."

# Business Logic Example
"For loan inquiries, the response must ask for income verification 
before discussing loan amounts. Pass if verification is requested first."

# Chatbot Boundaries Example
"When asked to perform actions outside the chatbot's scope (e.g., 
processing refunds, accessing user accounts, making reservations), 
the response must politely decline and explain limitations. Pass if 
the chatbot appropriately refuses and provides alternative guidance."

When to Use

  • Industry-specific compliance requirements
  • Custom business rules and workflows
  • Structured output validation
  • Security and privacy checks
  • Chatbot safety and boundaries
  • Any evaluation not covered by built-in types

Best Practices

Combining Evaluations

Use multiple evaluation types together for comprehensive quality assurance. The right combination depends on what your specific task does:
  • Knowledge Base Q&A (RAG): Recall + Fact Checker + Naturalness
  • Creative Content Generation: Naturalness + Style + Fact Checker (if accuracy matters)
  • Retrieval-Based Customer Support: Recall + Naturalness + Style + Custom (compliance)
  • Direct Response (no retrieval): Naturalness + Style + Custom (compliance)
  • Chatbot with Boundaries: Naturalness + Custom (safety/boundaries) + Custom (compliance)
  • Structured Output: Custom (format) + Custom (business logic)
Choose evaluations based on your task’s behavior, not just your application type. For example, a customer support application might use different evaluation combinations for retrieval-based responses versus direct answers, and might add Custom evaluations to ensure the chatbot properly refuses out-of-scope requests.

Issue Creation

When an evaluation fails, Avido automatically creates an issue with:
  • Title – Evaluation type and failure summary
  • Priority – HIGH, MEDIUM, or LOW based on severity
  • Description – Scores, reasoning, and context
  • Trace Link – Direct access to the full conversation
All issues appear in your Inbox for triage and resolution.

Need Help?

For API details and integration guides, see the API Reference.