Overview
Avido provides six evaluation types to measure different aspects of AI quality. Each evaluation is applied to tasks, then runs automatically when those tasks execute, providing actionable insights when quality standards aren’t met.| Evaluation Type | Purpose | Score Range | Pass Threshold |
|---|---|---|---|
| Naturalness | Human-like communication quality | 1-5 | 3.5 |
| Style | Brand guideline compliance | 1-5 | 3.5 |
| Recall | RAG pipeline performance | 0-1 | 0.5 |
| Fact Checker | Factual accuracy vs ground truth | 0-1 | 0.8 |
| Custom | Domain-specific criteria | 0-1 | 0.5 |
Naturalness
Measures how natural, engaging, and clear your AI’s responses are to users.What It Evaluates
The Naturalness evaluation assesses five dimensions of response quality:- Coherence – Logical flow and consistency of ideas
- Engagingness – Ability to capture and maintain user interest
- Naturalness – Human-like language and tone
- Relevance – On-topic responses that address the user’s intent
- Clarity – Clear, understandable language without ambiguity
How It Works
An LLM evaluates your AI’s response across all five dimensions on a 1-5 scale. The overall score is the average of these dimensions. Pass Criteria:- All five dimensions must score ≥ 3.5
- Average score ≥ 3.5
Example Results
| Coherence | Engagingness | Naturalness | Relevance | Clarity | Overall | Result |
|---|---|---|---|---|---|---|
| 5 | 5 | 5 | 5 | 5 | 5.0 | ✅ Pass |
| 4 | 4 | 4 | 4 | 4 | 4.0 | ✅ Pass |
| 5 | 5 | 5 | 5 | 2 | 4.4 | ❌ Fail (Clarity < 3.5) |
When to Use
- Conversational AI and chatbots
- Customer support automation
- Content generation systems
- Any user-facing AI interactions
Style
Evaluates whether responses adhere to your organization’s style guidelines and brand voice.What It Evaluates
A single comprehensive score (1-5) based on your custom style guide, measuring:- Tone and voice consistency
- Terminology usage
- Format and structure requirements
- Brand-specific guidelines
- Reading level and complexity
How It Works
You provide a style guide document that defines your brand’s communication standards. An LLM evaluates each response against this guide and provides:- A score from 1-5
- Detailed analysis explaining the rating
- Score ≥ 3.5
Example Style Guide Elements
When to Use
- Brand-critical communications
- Multi-channel consistency (chat, email, voice)
- Customer-facing applications where brand matters
Recall (RAG Evaluation)
Comprehensive evaluation of Retrieval-Augmented Generation (RAG) pipeline quality.What It Evaluates
Four metrics that measure different aspects of RAG performance:- Context Relevancy – Are retrieved documents relevant to the query?
- Context Precision – How well-ranked are the retrieved documents?
- Faithfulness – Is the answer grounded in the retrieved context?
- Answer Relevancy – Does the answer address the user’s question?
How It Works
Each metric produces a score from 0-1 (higher is better). The overall score is the average of Context Precision, Faithfulness, and Answer Relevancy. Pass Criteria:- Context Precision ≥ 0.5
- Faithfulness ≥ 0.5
- Answer Relevancy ≥ 0.5
Score Interpretation
| Score Range | Interpretation | Action Required |
|---|---|---|
| 0.8 - 1.0 | Excellent performance | Monitor |
| 0.5 - 0.8 | Acceptable quality | Optimize if critical |
| 0.0 - 0.5 | Poor performance | Investigate immediately |
Common Issues and Solutions
| Low Metric | Likely Cause | Solution |
|---|---|---|
| Context Precision | Too many irrelevant chunks retrieved | Reduce top_k, improve filters |
| Context Relevancy | Embedding/index drift | Retrain embeddings, update index |
| Faithfulness | Model hallucinating | Add grounding instructions, reduce temperature |
| Answer Relevancy | Answer drifts off-topic | Improve prompt focus, add constraints |
When to Use
- Knowledge base search and retrieval
- Document Q&A systems
- RAG pipelines
- Any system combining retrieval with generation
Fact Checker
Validates factual accuracy of AI responses against ground truth.What It Evaluates
Compares AI-generated statements with known correct information, classifying each statement as:- True Positives (TP) – Correct facts present in the response
- False Positives (FP) – Incorrect facts in the response
- False Negatives (FN) – Correct facts omitted from the response
How It Works
An LLM extracts factual statements from both the AI response and ground truth, then classifies them. The F1 score measures accuracy:- F1 score ≥ 0.8
Example Classification
Question: “What powers the sun?” Ground Truth: “The sun is powered by nuclear fusion. In its core, hydrogen atoms fuse to form helium, releasing tremendous energy.” AI Response: “The sun is powered by nuclear fission, similar to nuclear reactors, and provides light to the solar system.” Classification:- TP: [“Provides light to the solar system”]
- FP: [“Powered by nuclear fission”, “Similar to nuclear reactors”]
- FN: [“Powered by nuclear fusion”, “Hydrogen fuses to form helium”]
- F1 Score: 0.20 → ❌ Fail
Score Examples
| TP | FP | FN | F1 Score | Result | Notes |
|---|---|---|---|---|---|
| 5 | 0 | 0 | 1.0 | ✅ Pass | Perfect accuracy |
| 5 | 0 | 1 | 0.91 | ✅ Pass | Minor omission acceptable |
| 5 | 1 | 0 | 0.91 | ✅ Pass | Minor error acceptable |
| 4 | 1 | 0 | 0.8 | ✅ Pass | Boundary case |
| 3 | 0 | 2 | 0.75 | ❌ Fail | Too many omissions |
| 1 | 4 | 0 | 0.33 | ❌ Fail | Mostly incorrect |
When to Use
- Financial data and calculations
- Medical or legal information
- Product specifications and features
- Any domain where factual accuracy is critical
Custom
Create domain-specific evaluations for your unique business requirements.What It Evaluates
Whatever you define in a custom criterion. Common use cases:- Regulatory compliance checks
- Schema or format validation
- Latency or performance SLAs
- Business logic requirements
- Security and privacy rules
How It Works
You provide a criterion describing what to check. An LLM evaluates the response and returns:- Binary pass/fail (1 or 0)
- Reasoning explaining the decision
- Score = 1 (criterion met)
Example Criteria
When to Use
- Industry-specific compliance requirements
- Custom business rules and workflows
- Structured output validation
- Security and privacy checks
- Chatbot safety and boundaries
- Any evaluation not covered by built-in types
Best Practices
Combining Evaluations
Use multiple evaluation types together for comprehensive quality assurance. The right combination depends on what your specific task does:- Knowledge Base Q&A (RAG): Recall + Fact Checker + Naturalness
- Creative Content Generation: Naturalness + Style + Fact Checker (if accuracy matters)
- Retrieval-Based Customer Support: Recall + Naturalness + Style + Custom (compliance)
- Direct Response (no retrieval): Naturalness + Style + Custom (compliance)
- Chatbot with Boundaries: Naturalness + Custom (safety/boundaries) + Custom (compliance)
- Structured Output: Custom (format) + Custom (business logic)
Issue Creation
When an evaluation fails, Avido automatically creates an issue with:- Title – Evaluation type and failure summary
- Priority – HIGH, MEDIUM, or LOW based on severity
- Description – Scores, reasoning, and context
- Trace Link – Direct access to the full conversation
Need Help?
- Email – support@avidoai.com