Private Benchmarks
ToothFairyAI allows organisations to create their internal private benchmarks to evaluate the quality of the generated responses by the AI agents within the platform. Benchmarks are used to store predefined sets of questions, answers, reasoning snippets, and context to rate and verify the quality of the generated responses.
Menu location
Benchmarks can be accessed from the following menu:
Advanced > Benchmarks
The benchmarks section consist of the following tabs:
- Agents: This tab allows the user to run an already created benchmark against a specific agent
- Planners (coming soon): This tab will allow the user to run an already created benchmark against a specific planner agent. This feature is not available yet.
Run benchmark
When the user clicks on the Run benchmark
button, the system will display a modal with the following options:
- Select an agent: The user can select the agent against which the benchmark will be run
- Name of the run: The user can assign a name to the run for easy identification which will be concatenated with the name of the benchmark upon creation
- Description: The user can add a description to the run for easy identification and reference (optional)
- Select a benchmark: The user can select the benchmark that will be run against the agent. See here how to create your first private benchmark dataset.
- Select a reviewing model: The user can select the reviewing model that will be used to evaluate the responses generated by the agent. At the moment, we only support a subset of the available models for the reviewing process. The user can select one of the following models:
TF Sorcerer
: The default model used by any agent fine-tuned by ToothFairyAI.TF Sorcerer 1.5 (experimental)
: A quantum leap in the capabilities of the Sorcerer model, offering enhanced performance and accuracy. Request access to this model from the support team.TF Mystica
A larger model more powerful and more accurate however slower.Llama 3.1 8b - FP8
: The latest model created by Meta AI in its smallest version and larger context window.Llama 3.1 70b - FP8
: The legacy model created by Meta AI in its largest version and larger context window.Llama 3.1 405b - FP8
: The largest open-sourced and state of the art model provided by Meta.Llama 3.1 Nemotron 70b - FP16
: The latest fine-tuned model created by Nvidia using 70b Llama 3.1 model as base model.Llama 3.3 70b - FP8
: The latest open-sourced model created by Meta and larger context window
- Evaluation criteria:
- Correctness: How much weight should be given to the correctness of the response (default 80%)
- Reasoning: How much weight should be given to the accurate and correct reasoning behind the response (default 15%)
- Context: How much weight should be given to the adherence to the context of the question (default 5%) The sum of the weights is automatically forced to be equal to 100%.
- Passing Threshold score: The user can set a threshold score that the agent must achieve to pass the benchmark. The threshold score is a percentage value between 0 and 100. The default value is 90%.
The scoring of the responses is calculated by the reviewing model selected by the user. The reviewing model will evaluate the responses generated by the agent against the expected answers and the correct reasoning (if provided) in the benchmark and provide a score based on the evaluation criteria set by the user.
ToothFairyAI automatically sets the temperature of the reviewing model to 0 to ensure the most consistent results across different runs.
Upon clicking Confirm, the benchmark run will be put in a queue for processing. Depending on the size of the benchmark and the number of concurrent runs, the processing time may vary. The user can check the status of the run in the Benchmarks
section.
The status of the run can be one of the following:
- Queued: The run is waiting to be processed.
- In Progress: The run is currently being processed.
- Completed: The run has been processed and the results are available.
- Failed: The run has failed and the results are not available.
The number of concurrent runs that can be processed at the same time is limited. By default, all workspaces can run only one run at a time. If you need to run multiple benchmarks at the same time, please contact the support team.
Benchmark results
Once the benchmark run is completed, the user can view the results in the Benchmarks
section. The results will include the following information:
- Benchmark Name: The name of the benchmark.
- Benchmark Description: The description of the benchmark.
- Benchmark Status: The status of the benchmark.
- Run configuration: The configuration of the benchmark run. This includes the agent, reviewing model, evaluation criteria, and passing threshold score in json format.
- Run Output: The output of the benchmark run. This includes the evaluation results, the passing threshold score, and the passing status.
Download benchmark results
The user can download the results of the benchmark run by clicking on the Download
button. You will be able to download both the raw output of each response in a CSV format and also the summary of the results in the metadata.json file.