Skip to main content

Benchmarks

Benchmarking is a process of evaluating the performance of a system or a component against a set of predefined metrics. In the context of ToothFairyAI, benchmarks are used to store predefined sets of questions, answers, reasoning snippets, and context to rate and verify the quality of the generated responses. Compared to other platforms, ToothFairyAI benchmarks are agentic in nature meaning that rather than looking at the quality of the responses of a model, the benchmarks are used to evaluate the quality of the responses of the agent. This includes scenarios where the agent is required to generate a response based on a given set of documents (rag) or based on a series of tools to be used to accomplish a task.

Benchmark Types

ToothFairyAI supports two types of benchmarks:

TypeDescriptionUse Case
Standard BMEvaluates agent responses against expected answersQuality assurance, performance testing
Antagonistic BM (ABM)Tests agent consistency over multi-turn conversationsConsistency testing, reliability verification

Standard BM

Standard benchmarks evaluate an agent's responses against predefined correct answers. The reviewer model compares the agent's output with the expected answer, reasoning, and context to generate a quality score.

Antagonistic BM (ABM)

Antagonistic benchmarks feature a multi-turn conversation where two agents interact in a "battle" format to test the consistency and robustness of the Benchmarked Agent. The system uses three key components:

RoleDescription
Benchmarked AgentThe agent being evaluated - responds to questions and challenges
Examiner AgentThe tester agent - questions and probes the benchmarked agent
Reviewer ModelAn independent model that scores the quality of responses from both agents

ABM operates in two modes:

ModeDescriptionQuestion RequirementMax Turns
StaticExaminer follows a fully scripted set of questions from the benchmark1-200 questionsUp to 50 turns
DynamicExaminer generates probing questions dynamically based on an initial seed question1 seed question minimumUp to 50 turns

How ABM Works

  1. Agent Selection: Select two agents - one as the Benchmarked Agent and one as the Examiner Agent
  2. Conversation Flow: The Examiner Agent engages the Benchmarked Agent in a multi-turn conversation
  3. Response Scoring: The Reviewer Model evaluates each response from the Benchmarked Agent
  4. Quality Assessment: Scores are calculated based on correctness, reasoning, and context adherence
  5. Turn Limit: Conversations can run up to 50 turns for comprehensive testing

ABM Use Cases

  • Consistency Testing: Verify agents maintain coherent responses across extended conversations
  • Robustness Validation: Test how well agents handle challenging or adversarial questions
  • Edge Case Discovery: Identify weaknesses in agent responses through probing questions
  • Performance Benchmarking: Compare agent performance against standardized adversarial tests
Choosing Between Static and Dynamic
  • Static Mode: Best for reproducible tests with known question sequences - ideal for regression testing
  • Dynamic Mode: Best for discovering unknown weaknesses through AI-generated probing questions

Create a benchmark

  1. Click on the Create button.
  2. Assign a name and a description (optional) to the benchmark for easy identification.
  3. Upload the benchmark csv file - The benchmark csv file should contain the following columns:

File size limits for benchmark CSV files:

  • All subscriptions: Maximum 2MB per CSV file (fixed limit for benchmark files)

Question count requirements:

Benchmark TypeMinimumRecommendedMaximum
Standard BM1--
ABM Static110+200
ABM Dynamic11-
questionanswerreasoningcontext
Question 1Answer 1Reasoning 1Context 1
Question 2Answer 2Reasoning 2Context 2
Question 3Answer 3Reasoning 3Context 3

For best results in terms of compatibility of the csv files you want to upload it is recommended to use csv with text qualification with double quotes notation.

For example:

"What is the capital of France?";"Paris";"The capital of France is Paris.";"Let's talk about geography."
"What is the capital of Italy?";"Rome";"The capital of Italy is Rome.";"Let's talk about best countries in the world."
How to structure the CSV files

When uploading a CSV file, it is important to ensure that the file is correctly formatted. The file must not contain a header row with the column names. Each field should be wrapped in double quotes and separated by semicolons.

Upon uploading the file, the system will automatically parse the file and display the contents in the preview section, specifically the first 10 rows. The system will also show a Benchmark Compatibility indicator showing which benchmark types are supported.

In case the file is not correctly formatted, an error message will be displayed.

  1. Click on the Save button to create the benchmark.
Benchmark availability

Benchmarks are available only for Pro, Business and Enterprise plans. Please contact us for more information to start creating benchmarks.

Benchmark Compatibility Indicators

After uploading a benchmark file, the system displays compatibility indicators:

IndicatorMeaning
Standard BMFile has at least 1 question - compatible with Standard BM
Standard BMNo questions - not compatible
ABM StaticFile has 10-200 questions - compatible with ABM Static mode
ABM StaticLess than 10 questions - not recommended for ABM Static
ABM StaticMore than 200 questions - exceeds maximum limit
ABM DynamicFile has at least 1 question - compatible with ABM Dynamic mode
ABM DynamicNo questions - not compatible

Benchmark details

  • Question: The question that the agent is required to answer. This field mimics the user input.
  • Answer: The expected answer that the agent should generate. This field is used to evaluate the quality of the response (primarily used in Standard BM).
  • Reasoning: The reasoning behind the answer. This field is used to evaluate the quality of the response.
  • Context: The context in which the question is asked. This field mimics the context in which the user is asking the question. For example the summary of previous turns in a conversation.

Validation Rules

The following validation rules apply when uploading benchmark CSV files:

Field Validation

  • Each field must be wrapped in double quotes
  • Fields are separated by semicolons (;)
  • Maximum field length: 6,000 characters per field
  • Minimum fields required: 2 (question and answer)
  • Maximum fields allowed: 4 (question, answer, reasoning, context)

File Validation

  • Maximum file size: 2MB
  • No header row required
  • UTF-8 encoding recommended

Error Messages

ErrorCauseSolution
"Invalid number of fields"Line has fewer than 2 or more than 4 fieldsEnsure each line has 2-4 semicolon-separated fields
"Question field is required"Question field is emptyProvide a question for each row
"Answer field is required"Answer field is emptyProvide an answer for each row
"Question exceeds 6000 characters"Question too longShorten the question text
"File parsing error"Malformed CSV structureCheck file formatting and encoding
Orchestrator and Desktop agents

Orchestrator agents can now be benchmarked just like standard agents, enabling comprehensive performance testing of complex workflows. Desktop agents are not yet supported for benchmarking.

Orchestrator Benchmarking

Orchestrator agents can be benchmarked to evaluate workflow performance and coordination capabilities:

Key Capabilities:

  • Workflow Performance Testing - Measure how well orchestrators accomplish specific workflows
  • Condition-Based Evaluation - Test orchestrator performance under various conditions and scenarios
  • Standard Benchmarking Interface - Same familiar benchmarking tools you use for regular agents
  • Parallel Task Assessment - Evaluate how orchestrators handle multiple concurrent operations
  • Reliability Metrics - Track success rates, execution times, and resource efficiency

Benchmarking Orchestrators:

  1. Create a benchmark with questions that require multi-step orchestration
  2. Assign the orchestrator agent as the benchmarked agent
  3. Run the benchmark to evaluate planning, execution, and coordination performance
Best Practices for Orchestrator Benchmarks

When benchmarking orchestrators, include scenarios that test:

  • Multi-agent coordination
  • Dynamic plan adjustment
  • Error recovery and re-planning
  • Resource allocation decisions