VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Junjie Zhou,Zheng Liu,Shitao Xiao,Bo Zhao,Yongping Xiong
2024-06-07
Abstract:Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.
Information Retrieval,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key challenges in multi-modal retrieval. Existing retrieval models are mostly text-centric and lack the ability to handle visual information. Although there are visual-language models like CLIP, these methods still have significant limitations in representing pure text and pure image data. Specifically, the paper focuses on the following points: 1. **Integrated Representation of Text and Images**: - Existing visual-language models (such as CLIP) are far inferior in text representation capabilities compared to the latest pure text embedding models (such as E5 and BGE). - These models mainly focus on independently encoding text and images, with little research on the joint representation of image-text data (e.g., documents with illustrations). 2. **High-Quality Generation of Multi-Modal Data**: - The construction of multi-modal datasets usually requires a large amount of manual annotation, which limits the scale of the datasets and thus affects the training effectiveness of multi-modal embedding models. 3. **Efficient Training of Multi-Modal Embedding Models**: - An effective training strategy is needed to enable the model to pre-train on large-scale weakly labeled data and further enhance multi-modal representation capabilities through the generated high-quality image-text data. ### Solutions To address the above challenges, the authors propose a new embedding model named VISTA. The main contributions of VISTA include: 1. **Flexible Model Architecture**: - By introducing visual token embeddings, the powerful text encoder is extended to a model with image understanding capabilities. - This architecture not only achieves deep integration of text and image data but also retains the strong performance of the text encoder. 2. **Innovative Data Generation Strategy**: - Two data generation pipelines are proposed, one for generating image-text combined data and the other for generating text-to-image-text combined data, ensuring large-scale, high-quality data for training multi-modal embedding models. 3. **Multi-Stage Training Algorithm**: - The first stage involves a basic text-image matching task using a large amount of weakly labeled cross-modal data to align visual token embeddings with the text encoder. - The second stage involves training the joint representation capability using the generated image-text combined data, further enhancing the model's multi-modal embedding capability. ### Experimental Results Through extensive experimental validation, VISTA performs excellently in various multi-modal retrieval tasks under zero-shot and supervised settings, significantly outperforming existing baseline models. Additionally, by conducting multi-modal training on the generated datasets, the zero-shot performance of all baseline models is also significantly improved, demonstrating the effectiveness and generality of the generated datasets.