Training data
ToothFairyAI allows you to manage the datasets needed to train your AI models. This document provides an overview of the training data creation and import process.
Menu location
Training data can be accessed from the following menu:
Training data > Create new doc
The fine-tuning section consist of the following tabs:
- Generative AI: This tab allows the user to store datasets specific to the
Content creator
agents - Conversational AI: This tab allows the user to store datasets specific to the
Chat
agents - Sentiment reviewer: This tab allows the user to store datasets specific to the
Sentiment reviewer
agents
Creation of training data
The creation of the training data is ultimately comes down to the creation of a Live document
with a specific structure.
As for the Knowledge Hub
live documents, users can collaborate in real time to create the training data needed to fine-tune the AI models.
The fantastic feature of this approach is that the training data is a blend between real documents and structure embedded snippets aware of the content surrounding them.
The behaviour of the Live document
is very similar to the Knowledge Hub
live documents, with the following differences:
Generative AI and Conversational AI
- Topics: The topics are used to filter the training data. The user can select the topics that are relevant to the training data
- Validation: It sets whether the document and the training data is used for fine-tuning the AI model or solely for validation purposes
- Add snippet: Generative AI will show under the editor toolbar a button with label
ADD GTS
while Conversational AI will showADD TURN
. Clicking on the button will add a new snippet to the document (more detail in the next section)
Generative AI training snippet
The generative AI snippet consists of the following fields:
- System prompt: The system instruction that the AI model will use to generate the response in a non multiturn conversation
- Completion: What the AI model should generate as a response to the system prompt
- Context: The context is used as direct prompt for the AI model to generate the response which can be inspected by clicking on the
Context
info label. In case the context is not provided within the document or it is not relevant, the user can choose to force the context to be used using theForce
switch at the top of the snippet. When the context is forced, a new input field will appear where the user can provide the context
The context and the completion must be more than 128 characters and no longer than 1200 characters.
The user can delete the snippet by clicking on the Delete
icon on the top right hand side of snippet
Conversational AI training snippet
The conversational AI snippet consists of the following fields:
- Role: The selection of the role of the AI model in the conversation. The user can select from the following roles:
- User: The AI model will act as the user in the conversation
- Assistant: The AI model will act as the AI in the conversation
- Message: The message of the user or the assistant in the conversation. It is highly recommended to alternate strictly between the user and the assistant messages and finish the conversation with the assistant message.
The editor content outside the snippets is not used for the chat training data; however it can be used to provide additional information to the user reviewing the chat training data.
The user can delete the snippet by clicking on the Delete
icon on the top right hand side of snippet
Sentiment reviewer
- Validation: It sets whether the document and the training data is used for fine-tuning the AI model or solely for validation purposes
- NER toolbar: It allows the user to highlight the entities in the document. The entities are used to train the sentiment reviewer model
- Intent selection: It allows the user to select the intent of each paragraph within the document (head icon) on the right hand side of the editor line.
Both NER and Intents are fully customisable and can be configured in the Settings/NERs
and Settings/Intents
sections respectively.
Import of training data
ToothFairyAI allows you import data in the form of JSONL files that natively support the structure required by the AI models and blend with the data created in the platform through the Live documents
.
Generative AI and Conversational AI training data
The import of the training data is done by uploading a JSONL file. The JSONL file must be formatted according to the structure required by the AI model. ToothFairyAI will not validate the content of the JSONL file, so it is important to ensure that the JSONL file is correctly formatted.
Each line of the JSONL file must contain the following fields:
- sys: The system prompt that the AI model will use to generate the response in a non multiturn conversation (only applicable for Generative AI)
- prompt: The prompt that the AI model will receive from the users
- completion: The completion that the AI model should generate as a response to the prompt