Benchmarks
Benchmarking is a process of evaluating the performance of a system or a component against a set of predefined metrics. In the context of ToothFairyAI, benchmarks are used to store predefined sets of questions, answers, reasoning snippets, and context to rate and verify the quality of the generated responses. Compared to other platforms, ToothFairyAI benchmarks are agentic in nature meaning that rather than looking at the quality of the responses of a model, the benchmarks are used to evaluate the quality of the responses of the agent. This includes scenarios where the agent is required to generate a response based on a given set of documents (rag) or based on a series of tools to be used to accomplish a task.
Benchmark Types
ToothFairyAI supports two types of benchmarks:
| Type | Description | Use Case |
|---|---|---|
| Standard BM | Evaluates agent responses against expected answers | Quality assurance, performance testing |
| Antagonistic BM (ABM) | Tests agent consistency over multi-turn conversations | Consistency testing, reliability verification |
Standard BM
Standard benchmarks evaluate an agent's responses against predefined correct answers. The reviewer model compares the agent's output with the expected answer, reasoning, and context to generate a quality score.
Antagonistic BM (ABM)
Antagonistic benchmarks feature a multi-turn conversation where two agents interact in a "battle" format to test the consistency and robustness of the Benchmarked Agent. The system uses three key components:
| Role | Description |
|---|---|
| Benchmarked Agent | The agent being evaluated - responds to questions and challenges |
| Examiner Agent | The tester agent - questions and probes the benchmarked agent |
| Reviewer Model | An independent model that scores the quality of responses from both agents |
ABM operates in two modes:
| Mode | Description | Question Requirement | Max Turns |
|---|---|---|---|
| Static | Examiner follows a fully scripted set of questions from the benchmark | 1-200 questions | Up to 50 turns |
| Dynamic | Examiner generates probing questions dynamically based on an initial seed question | 1 seed question minimum | Up to 50 turns |
How ABM Works
- Agent Selection: Select two agents - one as the Benchmarked Agent and one as the Examiner Agent
- Conversation Flow: The Examiner Agent engages the Benchmarked Agent in a multi-turn conversation
- Response Scoring: The Reviewer Model evaluates each response from the Benchmarked Agent
- Quality Assessment: Scores are calculated based on correctness, reasoning, and context adherence
- Turn Limit: Conversations can run up to 50 turns for comprehensive testing
ABM Use Cases
- Consistency Testing: Verify agents maintain coherent responses across extended conversations
- Robustness Validation: Test how well agents handle challenging or adversarial questions
- Edge Case Discovery: Identify weaknesses in agent responses through probing questions
- Performance Benchmarking: Compare agent performance against standardized adversarial tests
- Static Mode: Best for reproducible tests with known question sequences - ideal for regression testing
- Dynamic Mode: Best for discovering unknown weaknesses through AI-generated probing questions
Create a benchmark
- Click on the
Createbutton. - Assign a name and a description (optional) to the benchmark for easy identification.
- Upload the benchmark csv file - The benchmark csv file should contain the following columns:
File size limits for benchmark CSV files:
- All subscriptions: Maximum 2MB per CSV file (fixed limit for benchmark files)
Question count requirements:
| Benchmark Type | Minimum | Recommended | Maximum |
|---|---|---|---|
| Standard BM | 1 | - | - |
| ABM Static | 1 | 10+ | 200 |
| ABM Dynamic | 1 | 1 | - |
| question | answer | reasoning | context |
|---|---|---|---|
| Question 1 | Answer 1 | Reasoning 1 | Context 1 |
| Question 2 | Answer 2 | Reasoning 2 | Context 2 |
| Question 3 | Answer 3 | Reasoning 3 | Context 3 |
For best results in terms of compatibility of the csv files you want to upload it is recommended to use csv with text qualification with double quotes notation.
For example:
"What is the capital of France?";"Paris";"The capital of France is Paris.";"Let's talk about geography."
"What is the capital of Italy?";"Rome";"The capital of Italy is Rome.";"Let's talk about best countries in the world."
When uploading a CSV file, it is important to ensure that the file is correctly formatted. The file must not contain a header row with the column names. Each field should be wrapped in double quotes and separated by semicolons.
Upon uploading the file, the system will automatically parse the file and display the contents in the preview section, specifically the first 10 rows. The system will also show a Benchmark Compatibility indicator showing which benchmark types are supported.
In case the file is not correctly formatted, an error message will be displayed.
- Click on the
Savebutton to create the benchmark.
Benchmarks are available only for Pro, Business and Enterprise plans. Please contact us for more information to start creating benchmarks.
Benchmark Compatibility Indicators
After uploading a benchmark file, the system displays compatibility indicators:
| Indicator | Meaning |
|---|---|
| ✓ Standard BM | File has at least 1 question - compatible with Standard BM |
| ✗ Standard BM | No questions - not compatible |
| ✓ ABM Static | File has 10-200 questions - compatible with ABM Static mode |
| ✗ ABM Static | Less than 10 questions - not recommended for ABM Static |
| ⚠ ABM Static | More than 200 questions - exceeds maximum limit |
| ✓ ABM Dynamic | File has at least 1 question - compatible with ABM Dynamic mode |
| ✗ ABM Dynamic | No questions - not compatible |
Benchmark details
- Question: The question that the agent is required to answer. This field mimics the user input.
- Answer: The expected answer that the agent should generate. This field is used to evaluate the quality of the response (primarily used in Standard BM).
- Reasoning: The reasoning behind the answer. This field is used to evaluate the quality of the response.
- Context: The context in which the question is asked. This field mimics the context in which the user is asking the question. For example the summary of previous turns in a conversation.
Validation Rules
The following validation rules apply when uploading benchmark CSV files:
Field Validation
- Each field must be wrapped in double quotes
- Fields are separated by semicolons (
;) - Maximum field length: 6,000 characters per field
- Minimum fields required: 2 (question and answer)
- Maximum fields allowed: 4 (question, answer, reasoning, context)
File Validation
- Maximum file size: 2MB
- No header row required
- UTF-8 encoding recommended
Error Messages
| Error | Cause | Solution |
|---|---|---|
| "Invalid number of fields" | Line has fewer than 2 or more than 4 fields | Ensure each line has 2-4 semicolon-separated fields |
| "Question field is required" | Question field is empty | Provide a question for each row |
| "Answer field is required" | Answer field is empty | Provide an answer for each row |
| "Question exceeds 6000 characters" | Question too long | Shorten the question text |
| "File parsing error" | Malformed CSV structure | Check file formatting and encoding |
Orchestrator agents can now be benchmarked just like standard agents, enabling comprehensive performance testing of complex workflows. Desktop agents are not yet supported for benchmarking.
Orchestrator Benchmarking
Orchestrator agents can be benchmarked to evaluate workflow performance and coordination capabilities:
Key Capabilities:
- Workflow Performance Testing - Measure how well orchestrators accomplish specific workflows
- Condition-Based Evaluation - Test orchestrator performance under various conditions and scenarios
- Standard Benchmarking Interface - Same familiar benchmarking tools you use for regular agents
- Parallel Task Assessment - Evaluate how orchestrators handle multiple concurrent operations
- Reliability Metrics - Track success rates, execution times, and resource efficiency
Benchmarking Orchestrators:
- Create a benchmark with questions that require multi-step orchestration
- Assign the orchestrator agent as the benchmarked agent
- Run the benchmark to evaluate planning, execution, and coordination performance
When benchmarking orchestrators, include scenarios that test:
- Multi-agent coordination
- Dynamic plan adjustment
- Error recovery and re-planning
- Resource allocation decisions