Evaluating the efficacy of modality conversion in vector databases
- Authors
- Publication Date
- Jan 01, 2024
- Source
- DiVA - Academic Archive On-line
- Keywords
- Language
- English
- License
- Green
- External links
Abstract
This thesis explores the integration of generative artificial intelligence (AI) and vector databases to address the challenges that exist when working with multimodal data and vector embeddings. Particularly between audio and text. The problems when working with data of different modalities arises because different types of data such as audio and text are often represented in incompatible formats within vector databases which prevents efficient comparison and retrieval. It examines the efficacy of modality conversion with generative AI to overcome these difficulties when working with multimodal data and vector embeddings. Through tests of audio to text and text to audio conversions it examines cross modal retrieval in vector databases. The tests show that modality conversion, especially transcription and word embeddings can enable cross modal retrieval between audio(speech) and written text. Furthermore the transcription and word embedding method outperforms multimodal embedding models which use a shared embedding space for both text and audio. The drawback is that transcription results in some information loss of information that is inherent to speech, such as tone of voice, speaking speed and so on. This approach allows working with both text and audio inputs using generative AI to convert the data and enable similarity search between audio and text relating to that audio or text. This does however introduce new computational costs as in the case of storing audio into the database modality conversion with generative AI is needed in addition to embedding of the generated text. Audio inputs to query the database must then be treated in the same way Converting text to audio as input and trying to enable similarity search using audio embeddings was deemed largely unsuccessful due to the nature of audio and most common audio embedding models. Most embedding models capture lots of features from audio that are unrelated to what the speech represents. Things such as how loud it is, how fast its spoken and who is speaking. In the specific task of trying to use text input to find related audio this becomes a hindrance as the text input naturally lacks information about tone of voice, dialect, speaking speed and so on. These features are also not yet within our control as most text to speech models do not support controlling this factor. The lack of information in text that is present in audio as well as the nature of most audio embedding models makes this approach generally unsuccessful. Although some results were obtained that indicate significant performance it is not reliable enough to be applied for real use cases The results suggest that using generative AI for transcription to work with embeddings of different modalities is successful in the case of text and speech. Further research would need to be considered for other modalities (video, image, audio, text) and solutions need to be proposed to deal with information gaps between modalities as some modalities possess features not inherent in others.