> ## Documentation Index
> Fetch the complete documentation index at: https://docs.avidoai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

> Built-in and custom evaluation types to measure AI quality, safety, and performance.

## Overview

Avido provides six evaluation types to measure different aspects of AI quality. Each evaluation is applied to tasks, then runs automatically when those tasks execute, providing actionable insights when quality standards aren't met.

| Evaluation Type  | Purpose                          | Score Range | Pass Threshold |
| ---------------- | -------------------------------- | ----------- | -------------- |
| **Naturalness**  | Human-like communication quality | 1-5         | 3.5            |
| **Style**        | Brand guideline compliance       | 1-5         | 3.5            |
| **Recall**       | RAG pipeline performance         | 0-1         | 0.5            |
| **Fact Checker** | Factual accuracy vs ground truth | 0-1         | 0.8            |
| **Custom**       | Domain-specific criteria         | 0-1         | 0.5            |
| **Output Match** | Deterministic output validation  | 0-1         | 0.8            |

## Naturalness

Measures how natural, engaging, and clear your AI's responses are to users.

### What It Evaluates

The Naturalness evaluation assesses five dimensions of response quality:

* **Coherence** – Logical flow and consistency of ideas
* **Engagingness** – Ability to capture and maintain user interest
* **Naturalness** – Human-like language and tone
* **Relevance** – On-topic responses that address the user's intent
* **Clarity** – Clear, understandable language without ambiguity

### How It Works

An LLM evaluates your AI's response across all five dimensions on a 1-5 scale. The overall score is the average of these dimensions.

**Pass Criteria:**

* All five dimensions must score ≥ 3.5
* Average score ≥ 3.5

This ensures no single dimension fails even if the overall average is high.

### Example Results

| Coherence | Engagingness | Naturalness | Relevance | Clarity | Overall | Result                  |
| --------- | ------------ | ----------- | --------- | ------- | ------- | ----------------------- |
| 5         | 5            | 5           | 5         | 5       | 5.0     | ✅ Pass                  |
| 4         | 4            | 4           | 4         | 4       | 4.0     | ✅ Pass                  |
| 5         | 5            | 5           | 5         | 2       | 4.4     | ❌ Fail (Clarity \< 3.5) |

### When to Use

* Conversational AI and chatbots
* Customer support automation
* Content generation systems
* Any user-facing AI interactions

## Style

Evaluates whether responses adhere to your organization's style guidelines and brand voice.

### What It Evaluates

A single comprehensive score (1-5) based on your custom style guide, measuring:

* Tone and voice consistency
* Terminology usage
* Format and structure requirements
* Brand-specific guidelines
* Reading level and complexity

### How It Works

You provide a style guide document that defines your brand's communication standards. An LLM evaluates each response against this guide and provides:

* A score from 1-5
* Detailed analysis explaining the rating

**Pass Criteria:**

* Score ≥ 3.5

### Example Style Guide Elements

```markdown theme={null}
# Customer Support Style Guide

**Tone:** Professional yet friendly, never casual
**Voice:** Active voice preferred, clear and direct
**Terminology:** Use "account" not "profile", "transfer" not "send"
**Format:** Start with acknowledgment, provide solution, end with offer to help
**Constraints:** Keep responses under 100 words when possible
```

### When to Use

* Brand-critical communications
* Multi-channel consistency (chat, email, voice)
* Customer-facing applications where brand matters

Note: For regulated industries with strict compliance requirements, use Custom evaluations instead.

## Recall (RAG Evaluation)

Comprehensive evaluation of Retrieval-Augmented Generation (RAG) pipeline quality.

### What It Evaluates

Four metrics that measure different aspects of RAG performance:

* **Context Relevancy** – Are retrieved documents relevant to the query?
* **Context Precision** – How well-ranked are the retrieved documents?
* **Faithfulness** – Is the answer grounded in the retrieved context?
* **Answer Relevancy** – Does the answer address the user's question?

### How It Works

Each metric produces a score from 0-1 (higher is better). The overall score is the average of Context Precision, Faithfulness, and Answer Relevancy.

**Pass Criteria:**

* Context Precision ≥ 0.5
* Faithfulness ≥ 0.5
* Answer Relevancy ≥ 0.5

Note: Context Relevancy is computed for observability but doesn't affect pass/fail status.

### Score Interpretation

| Score Range | Interpretation        | Action Required         |
| ----------- | --------------------- | ----------------------- |
| 0.8 - 1.0   | Excellent performance | Monitor                 |
| 0.5 - 0.8   | Acceptable quality    | Optimize if critical    |
| 0.0 - 0.5   | Poor performance      | Investigate immediately |

### Common Issues and Solutions

| Low Metric        | Likely Cause                         | Solution                                       |
| ----------------- | ------------------------------------ | ---------------------------------------------- |
| Context Precision | Too many irrelevant chunks retrieved | Reduce top\_k, improve filters                 |
| Context Relevancy | Embedding/index drift                | Retrain embeddings, update index               |
| Faithfulness      | Model hallucinating                  | Add grounding instructions, reduce temperature |
| Answer Relevancy  | Answer drifts off-topic              | Improve prompt focus, add constraints          |

### When to Use

* Knowledge base search and retrieval
* Document Q\&A systems
* RAG pipelines
* Any system combining retrieval with generation

## Fact Checker

Validates factual accuracy of AI responses against ground truth.

### What It Evaluates

Compares AI-generated statements with known correct information, classifying each statement as:

* **True Positives (TP)** – Correct facts present in the response
* **False Positives (FP)** – Incorrect facts in the response
* **False Negatives (FN)** – Correct facts omitted from the response

### How It Works

An LLM extracts factual statements from both the AI response and ground truth, then classifies them. The F1 score measures accuracy:

```
F1 = TP / (TP + 0.5 × (FP + FN))
```

**Pass Criteria:**

* F1 score ≥ 0.8

This allows high-quality answers with minor omissions while maintaining strict accuracy standards.

### Example Classification

**Question:** "What powers the sun?"

**Ground Truth:** "The sun is powered by nuclear fusion. In its core, hydrogen atoms fuse to form helium, releasing tremendous energy."

**AI Response:** "The sun is powered by nuclear fission, similar to nuclear reactors, and provides light to the solar system."

**Classification:**

* TP: \["Provides light to the solar system"]
* FP: \["Powered by nuclear fission", "Similar to nuclear reactors"]
* FN: \["Powered by nuclear fusion", "Hydrogen fuses to form helium"]
* F1 Score: 0.20 → ❌ Fail

### Score Examples

| TP | FP | FN | F1 Score | Result | Notes                     |
| -- | -- | -- | -------- | ------ | ------------------------- |
| 5  | 0  | 0  | 1.0      | ✅ Pass | Perfect accuracy          |
| 5  | 0  | 1  | 0.91     | ✅ Pass | Minor omission acceptable |
| 5  | 1  | 0  | 0.91     | ✅ Pass | Minor error acceptable    |
| 4  | 1  | 0  | 0.8      | ✅ Pass | Boundary case             |
| 3  | 0  | 2  | 0.75     | ❌ Fail | Too many omissions        |
| 1  | 4  | 0  | 0.33     | ❌ Fail | Mostly incorrect          |

### When to Use

* Financial data and calculations
* Medical or legal information
* Product specifications and features
* Any domain where factual accuracy is critical

## Custom

Create domain-specific evaluations for your unique business requirements.

### What It Evaluates

Whatever you define in a custom criterion. Common use cases:

* Regulatory compliance checks
* Schema or format validation
* Latency or performance SLAs
* Business logic requirements
* Security and privacy rules

### How It Works

You provide a criterion describing what to check. An LLM evaluates the response and returns:

* Binary pass/fail (1 or 0)
* Reasoning explaining the decision

**Pass Criteria:**

* Score = 1 (criterion met)

### Example Criteria

```markdown theme={null}
# Compliance Example
"The response must not mention specific account numbers, 
social security numbers, or other PII. Pass if no PII is present."

# Format Example
"The response must be formatted as a JSON object with 
'action', 'parameters', and 'reasoning' keys. Pass if valid JSON 
with all required keys."

# Business Logic Example
"For loan inquiries, the response must ask for income verification 
before discussing loan amounts. Pass if verification is requested first."

# Chatbot Boundaries Example
"When asked to perform actions outside the chatbot's scope (e.g., 
processing refunds, accessing user accounts, making reservations), 
the response must politely decline and explain limitations. Pass if 
the chatbot appropriately refuses and provides alternative guidance."
```

### When to Use

* Industry-specific compliance requirements
* Custom business rules and workflows
* Structured output validation
* Security and privacy checks
* Chatbot safety and boundaries
* Any evaluation not covered by built-in types

## Output Match

Deterministic validation of AI outputs against expected values, without using an LLM judge.

### What It Evaluates

Compares your AI's actual output against an expected value you define per task. Unlike other evaluations that use LLM judgment, Output Match performs exact comparison — making results fully reproducible and deterministic.

Two comparison modes are available:

* **String mode** – Exact string match between output and expected value
* **List mode** – Compare lists of values with flexible matching strategies

### How It Works

#### String Mode

The AI's response is compared directly against the expected string. Optionally, a regex extraction pattern can be applied first to pull a specific value from the response before comparison.

**Pass Criteria:**

* Extracted (or full) output exactly matches the expected string
* Score = 1 (match) or 0 (mismatch)

**Example:**

If your AI returns `"The order status is: SHIPPED"` and you configure:

* Extract pattern: `status is: (\w+)` (capture group 1)
* Expected: `SHIPPED`

The evaluation extracts `SHIPPED` from the response and compares it to the expected value → ✅ Pass.

#### List Mode

The AI's response is parsed as a list and compared against an expected list of values. Two matching strategies are available:

**Exact Unordered** – Both lists must contain exactly the same items (order doesn't matter).

* Score = 1 (exact match) or 0 (mismatch)

**Contains** – Measures overlap between the output and expected lists using a configurable metric:

* **Precision** – What fraction of the output items are correct?
* **Recall** – What fraction of the expected items are present?
* **F1** – Harmonic mean of precision and recall

**Pass Criteria:**

* Score ≥ 0.8 (default, configurable per evaluation)

### Configuration

| Setting            | Applies To      | Description                                                         |
| ------------------ | --------------- | ------------------------------------------------------------------- |
| **Type**           | All             | `string` or `list` — determines comparison mode                     |
| **Expected**       | Per task        | The expected output value (string) or values (list)                 |
| **Match Mode**     | List only       | `exact_unordered` or `contains`                                     |
| **Score Metric**   | List (contains) | `precision`, `recall`, or `f1` (default: `recall`)                  |
| **Pass Threshold** | List only       | Override the default 0.8 threshold (0-1)                            |
| **Extract**        | Optional        | Regex pattern to extract value(s) from the output before comparison |

### Score Examples

#### String Mode

| Output    | Expected  | Result                              |
| --------- | --------- | ----------------------------------- |
| `SHIPPED` | `SHIPPED` | ✅ Pass (score: 1.0)                 |
| `shipped` | `SHIPPED` | ❌ Fail (score: 0.0, case-sensitive) |
| `PENDING` | `SHIPPED` | ❌ Fail (score: 0.0)                 |

#### List Mode (Contains, F1)

| Output            | Expected          | Precision | Recall | F1   | Result |
| ----------------- | ----------------- | --------- | ------ | ---- | ------ |
| `["a", "b", "c"]` | `["a", "b", "c"]` | 1.0       | 1.0    | 1.0  | ✅ Pass |
| `["a", "b"]`      | `["a", "b", "c"]` | 1.0       | 0.67   | 0.8  | ✅ Pass |
| `["a", "b", "d"]` | `["a", "b", "c"]` | 0.67      | 0.67   | 0.67 | ❌ Fail |
| `["a"]`           | `["a", "b", "c"]` | 1.0       | 0.33   | 0.5  | ❌ Fail |

### When to Use

* Structured output validation (JSON fields, status codes, categories)
* Classification tasks with known correct answers
* Extraction pipelines where output must match expected values
* Regression testing with deterministic expected outputs
* Any task where you need exact, reproducible pass/fail without LLM judgment

## Best Practices

### Combining Evaluations

Use multiple evaluation types together for comprehensive quality assurance. The right combination depends on what your specific task does:

* **Knowledge Base Q\&A (RAG):** Recall + Fact Checker + Naturalness
* **Creative Content Generation:** Naturalness + Style + Fact Checker (if accuracy matters)
* **Retrieval-Based Customer Support:** Recall + Naturalness + Style + Custom (compliance)
* **Direct Response (no retrieval):** Naturalness + Style + Custom (compliance)
* **Chatbot with Boundaries:** Naturalness + Custom (safety/boundaries) + Custom (compliance)
* **Structured Output:** Output Match + Custom (business logic)
* **Classification / Extraction:** Output Match + Naturalness (if user-facing)

Choose evaluations based on your task's behavior, not just your application type. For example, a customer support application might use different evaluation combinations for retrieval-based responses versus direct answers, and might add Custom evaluations to ensure the chatbot properly refuses out-of-scope requests.

## Issue Creation

When an evaluation fails, Avido automatically creates an issue with:

* **Title** – Evaluation type and failure summary
* **Priority** – HIGH, MEDIUM, or LOW based on severity
* **Description** – Scores, reasoning, and context
* **Trace Link** – Direct access to the full conversation

All issues appear in your [Inbox](/inbox) for triage and resolution.

## Need Help?

* **Email** – [support@avidoai.com](mailto:support@avidoai.com)

For API details and integration guides, see the [API Reference](/api-reference).