Embeddings settings
The Embeddings
settings allow the Admins
and AI engineers
roles to better setup the embeddings and chunking settings for the workspace. This is to improve the performance of the AI models and the overall user experience.
Configurations list
Chunk strategy
The strategy used to chunk the text into smaller pieces for the embeddings to be generated. The options are:
Keywords
Summary
When the Keywords
option is selected, the text will be chunked giving higher relevance to keywords than overall context. This is useful when the text contains detailed information and the keywords are the most important part of the text.
When the Summary
option is selected, the text will be chunked giving higher relevance to the overall context than the keywords. This is useful when the text contains a lot of information and the overall context is the most important part of the text.
Image extraction strategy
The strategy used to extract the images from the text. The options are:
Safe
Brute force
When the Safe
option is selected, the images will be extracted using an efficient and fast extraction. This should be the setting under most circumstances with pdf files and webpages.
When the Brute force
option is selected, the images will be extracted using ToothFairyAI proprietary AI image extraction model. Ths can take longer and might extract more images than the Safe
option. However, due to the nature of the model, the images extracted might be cut off or not fully extracted.
Chunk max words
The maximum number of words that can be in a chunk. This is to ensure that the chunks are not too long and can be processed efficiently by the AI models.
The default value is 400
.
Chunk sentence overlap
The number of sentences that can overlap between chunks. This is to ensure that the chunks are not too fragmented and can be processed efficiently by the AI models.
The default value is 2
.
Min image width
The minimum width of the image to be extracted. This is to ensure that the images extracted are of a certain quality and can be processed efficiently by the AI models.
The default value is 300
.
Min image height
The minimum height of the image to be extracted. This is to ensure that the images extracted are of a certain quality and can be processed efficiently by the AI models.
The default value is 300
.
Min aspect ratio
The minimum aspect ratio of the image to be extracted (width/height).
The default value is 0.5
.
Max aspect ratio
The maximum aspect ratio of the image to be extracted (width/height).
The default value is 6
.
Changing the Embeddings
settings can have a significant impact on the performance of the AI models in terms of data retrieval and image extractions. It is recommended to test the settings with a small dataset before applying them to the entire workspace.