Embeddings settings

The Embeddings settings allow the Admins and AI engineers roles to better setup the embeddings and chunking settings for the workspace. This is to improve the performance of the AI models and the overall user experience.

Configurations list

Chunk strategy

The strategy used to chunk the text into smaller pieces for the embeddings to be generated. The options are:

Keywords
Summary

When the Keywords option is selected, the text will be chunked giving higher relevance to keywords than overall context. This is useful when the text contains detailed information and the keywords are the most important part of the text.

When the Summary option is selected, the text will be chunked giving higher relevance to the overall context than the keywords. This is useful when the text contains a lot of information and the overall context is the most important part of the text.

Image extraction strategy

The strategy used to extract the images from the text. The options are:

Safe
Brute force

When the Safe option is selected, the images will be extracted using an efficient and fast extraction. This should be the setting under most circumstances with pdf files and webpages.

When the Brute force option is selected, the images will be extracted using ToothFairyAI proprietary AI image extraction model. Ths can take longer and might extract more images than the Safe option. However, due to the nature of the model, the images extracted might be cut off or not fully extracted.

Chunk max words

The maximum number of words that can be in a chunk. This is to ensure that the chunks are not too long and can be processed efficiently by the AI models. The default value is 400.

Chunk sentence overlap

The number of sentences that can overlap between chunks. This is to ensure that the chunks are not too fragmented and can be processed efficiently by the AI models. The default value is 2.

Min image width

The minimum width of the image to be extracted. This is to ensure that the images extracted are of a certain quality and can be processed efficiently by the AI models. The default value is 300.

Min image height

The minimum height of the image to be extracted. This is to ensure that the images extracted are of a certain quality and can be processed efficiently by the AI models. The default value is 300.

Min aspect ratio

The minimum aspect ratio of the image to be extracted (width/height). The default value is 0.5.

Max aspect ratio

The maximum aspect ratio of the image to be extracted (width/height). The default value is 6.

tip

Changing the Embeddings settings can have a significant impact on the performance of the AI models in terms of data retrieval and image extractions. It is recommended to test the settings with a small dataset before applying them to the entire workspace.

Embeddings settings

Configurations list​

Chunk strategy​

Image extraction strategy​

Chunk max words​

Chunk sentence overlap​

Min image width​

Min image height​

Min aspect ratio​

Max aspect ratio​