Training data
ToothFairyAI allows you to manage the datasets needed to train your AI models. This document provides an overview of the training data creation and import process.
Menu location
Training data can be accessed from the following menu:
Training data > Create new doc
The fine-tuning section consist of the following tabs:
- Conversational AI: This tab allows the user to store datasets specific to the agents
Creation of training data
The creation of the training data is ultimately comes down to the creation of a Live document
with a specific structure.
As for the Knowledge Hub
live documents, users can collaborate in real time to create the training data needed to fine-tune the AI models.
The fantastic feature of this approach is that the training data is a blend between real documents and structure embedded snippets aware of the content surrounding them.
The behaviour of the Live document
is very similar to the Knowledge Hub
live documents, with the following differences:
Conversational AI
- Topics: The topics are used to filter the training data. The user can select the topics that are relevant to the training data
- Validation: It sets whether the document and the training data is used for fine-tuning the AI model or solely for validation purposes
- Add snippet: Conversational AI will show
ADD SYS
andADD TURN
.
The ADD SYS
button allows the user to inject a system message into the document as an initial block right under the title of the document. The system block is by all means optional however we do recommend including it if the downstream model you want to generate via finetuning requires a strong agentic behaviour.
The ADD TURN
button allows the user to add a new training snippet. More details below.
Conversational AI training snippet
The conversational AI snippet consists of the following fields:
- Role: The selection of the role of the AI model in the conversation. The user can select from the following roles:
- User: The AI model will act as the user in the conversation
- Assistant: The AI model will act as the AI in the conversation
- Message: The message of the user or the assistant in the conversation. It is highly recommended to alternate strictly between the user and the assistant messages and finish the conversation with the assistant message.
The editor content outside the snippets is not used for the chat training data; however it can be used to provide additional information to the user reviewing the chat training data.
The user can delete the snippet by clicking on the Delete
icon on the top right hand side of snippet
ToothFairyAI enforces the alternation between user and assistant roles at the time of the creation of the training data; this manifests in a semi-empty message with the appropriate role in between two consecutive snippers with the same role. We highly recommend to avoid these edge cases as they can significantly impact the quality of the dataset down the line.
Import of training data
ToothFairyAI allows you import data in the form of JSONL files that natively support the structure required by the AI models and blend with the data created in the platform through the Live documents
.
Conversational AI training data
The import of the training data is done by uploading a JSONL file. The JSONL file must be formatted according to the structure required by the AI model. ToothFairyAI will not validate the content of the JSONL file, so it is important to ensure that the JSONL file is correctly formatted.
Each line of the JSONL file must strictly contain a JSON object with the following structure:
{
"messages": [
{
"role" : "system",
"content" : "System message"
},
{
"role": "user",
"content": "User message"
},
{
"role": "assistant",
"content": "Assistant message"
},
...
]
}
ToothFairyAI performs a shallow validation of the data structure at the time of the upload of the jsonl file; however in case of errors, the platform will not be able to provide a detailed report of the issue with the dataset during the finetuning process. In most cases you will be able to see the details of the error for any given Fine-tuning job
in status InError