Skip to main content

Training data

ToothFairyAI allows you to manage the datasets needed to train your AI models. This document provides an overview of the training data creation and import process.

Training data can be accessed from the following menu:

Training data > Create new doc

The fine-tuning section consist of the following tabs:

  • Generative AI: This tab allows the user to store datasets specific to the Content creator agents
  • Conversational AI: This tab allows the user to store datasets specific to the Chat agents
  • Sentiment reviewer: This tab allows the user to store datasets specific to the Sentiment reviewer agents

Creation of training data

The creation of the training data is ultimately comes down to the creation of a Live document with a specific structure. As for the Knowledge Hub live documents, users can collaborate in real time to create the training data needed to fine-tune the AI models. The fantastic feature of this approach is that the training data is a blend between real documents and structure embedded snippets aware of the content surrounding them.

The behaviour of the Live document is very similar to the Knowledge Hub live documents, with the following differences:

Generative AI and Conversational AI

  • Topics: The topics are used to filter the training data. The user can select the topics that are relevant to the training data
  • Validation: It sets whether the document and the training data is used for fine-tuning the AI model or solely for validation purposes
  • Add snippet: Generative AI will show under the editor toolbar a button with label ADD GTS while Conversational AI will show ADD TURN. Clicking on the button will add a new snippet to the document (more detail in the next section)

Generative AI training snippet

The generative AI snippet consists of the following fields:

  • System prompt: The system instruction that the AI model will use to generate the response in a non multiturn conversation
  • Completion: What the AI model should generate as a response to the system prompt
  • Context: The context is used as direct prompt for the AI model to generate the response which can be inspected by clicking on the Context info label. In case the context is not provided within the document or it is not relevant, the user can choose to force the context to be used using the Force switch at the top of the snippet. When the context is forced, a new input field will appear where the user can provide the context

The context and the completion must be more than 128 characters and no longer than 1200 characters. The user can delete the snippet by clicking on the Delete icon on the top right hand side of snippet

Conversational AI training snippet

The conversational AI snippet consists of the following fields:

  • Role: The selection of the role of the AI model in the conversation. The user can select from the following roles:
    • User: The AI model will act as the user in the conversation
    • Assistant: The AI model will act as the AI in the conversation
  • Message: The message of the user or the assistant in the conversation. It is highly recommended to alternate strictly between the user and the assistant messages and finish the conversation with the assistant message.

The editor content outside the snippets is not used for the chat training data; however it can be used to provide additional information to the user reviewing the chat training data. The user can delete the snippet by clicking on the Delete icon on the top right hand side of snippet

Sentiment reviewer

  • Validation: It sets whether the document and the training data is used for fine-tuning the AI model or solely for validation purposes
  • NER toolbar: It allows the user to highlight the entities in the document. The entities are used to train the sentiment reviewer model
  • Intent selection: It allows the user to select the intent of each paragraph within the document (head icon) on the right hand side of the editor line.
tip

Both NER and Intents are fully customisable and can be configured in the Settings/NERs and Settings/Intents sections respectively.

Import of training data

ToothFairyAI allows you import data in the form of JSONL files that natively support the structure required by the AI models and blend with the data created in the platform through the Live documents.

Generative AI and Conversational AI training data

The import of the training data is done by uploading a JSONL file. The JSONL file must be formatted according to the structure required by the AI model. ToothFairyAI will not validate the content of the JSONL file, so it is important to ensure that the JSONL file is correctly formatted.

Each line of the JSONL file must contain the following fields:

  • sys: The system prompt that the AI model will use to generate the response in a non multiturn conversation (only applicable for Generative AI)
  • prompt: The prompt that the AI model will receive from the users
  • completion: The completion that the AI model should generate as a response to the prompt