Benchmarks
Benchmarking is a process of evaluating the performance of a system or a component against a set of predefined metrics. In the context of ToothFairy AI, benchmarks are used to store predefined sets of questions, answers, reasoning snippets, and context to rate and verify the quality of the generated responses. Compared to other platforms, ToothFairyAI benchmrks are agentic in nature meaning that rather than looking at the quality of the responses of a model, the benchmarks are used to evaluate the quality of the responses of the agent. This includes scenarios where the agent is required to generate a response based on a given set of documents (rag) or based on a series of tools to be used to accomplish a task.
Create a benchmark
- Click on the
Create
button. - Assing a name and a description (optional) to the benchmark for easy identification.
- Upload the benchmark csv file - The benchmark csv file should contain the following columns:
question | answer | reasoning | context |
---|---|---|---|
Question 1 | Answer 1 | Reasoning 1 | Context 1 |
Question 2 | Answer 2 | Reasoning 2 | Context 2 |
Question 3 | Answer 3 | Reasoning 3 | Context 3 |
For best results in terms of compatibility of the csv files you want to upload it is reccomended to use csv with text qualification with double quotes notation.
For example:
"question","answer","reasoning","context"
"What is the capital of France?","Paris","The capital of France is Paris.","Let's talk about geography."
"What is the capital of Italy?","Rome","The capital of Italy is Rome.","Let's talk about best countries in the world."
Upon uploading the file, the system will automatically parse the file and display the contents in the preview section, specifically the first 10 rows. In case the file is not correctly formatted, an error message will be displayed.
- Click on the
Save
button to create the benchmark.
Benchmarks are available only for Pro and Enterprise plans. Please contact us for more information to start creating benchmarks.
Benchmark details
- Question: The question that the agent is required to answer . This fields mimics the user input.
- Answer: The expected answer that the agent should generate. This field is used to evaluate the quality of the response.
- Reasoning: The reasoning behind the answer. This field is used to evaluate the quality of the response.
- Context: The context in which the question is asked. This field mimics the context in which the user is asking the question. For example the summary of previous turns in a conversation.
At this stage, benchmarks do not support the planner and desktop agents. Benchmarks are only available for the coder, retriever, and casual agents