Abstract:Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at <a class="link-external link-https" href="https://github.com/FlagOpen/FlagEmbedding" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key challenges in multi-modal retrieval. Existing retrieval models are mostly text-centric and lack the ability to handle visual information. Although there are visual-language models like CLIP, these methods still have significant limitations in representing pure text and pure image data. Specifically, the paper focuses on the following points: 1. **Integrated Representation of Text and Images**: - Existing visual-language models (such as CLIP) are far inferior in text representation capabilities compared to the latest pure text embedding models (such as E5 and BGE). - These models mainly focus on independently encoding text and images, with little research on the joint representation of image-text data (e.g., documents with illustrations). 2. **High-Quality Generation of Multi-Modal Data**: - The construction of multi-modal datasets usually requires a large amount of manual annotation, which limits the scale of the datasets and thus affects the training effectiveness of multi-modal embedding models. 3. **Efficient Training of Multi-Modal Embedding Models**: - An effective training strategy is needed to enable the model to pre-train on large-scale weakly labeled data and further enhance multi-modal representation capabilities through the generated high-quality image-text data. ### Solutions To address the above challenges, the authors propose a new embedding model named VISTA. The main contributions of VISTA include: 1. **Flexible Model Architecture**: - By introducing visual token embeddings, the powerful text encoder is extended to a model with image understanding capabilities. - This architecture not only achieves deep integration of text and image data but also retains the strong performance of the text encoder. 2. **Innovative Data Generation Strategy**: - Two data generation pipelines are proposed, one for generating image-text combined data and the other for generating text-to-image-text combined data, ensuring large-scale, high-quality data for training multi-modal embedding models. 3. **Multi-Stage Training Algorithm**: - The first stage involves a basic text-image matching task using a large amount of weakly labeled cross-modal data to align visual token embeddings with the text encoder. - The second stage involves training the joint representation capability using the generated image-text combined data, further enhancing the model's multi-modal embedding capability. ### Experimental Results Through extensive experimental validation, VISTA performs excellently in various multi-modal retrieval tasks under zero-shot and supervised settings, significantly outperforming existing baseline models. Additionally, by conducting multi-modal training on the generated datasets, the zero-shot performance of all baseline models is also significantly improved, demonstrating the effectiveness and generality of the generated datasets.

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Multi-view visual semantic embedding for cross-modal image–text retrieval

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Multimodal learning with only image data: A deep unsupervised model for street view image retrieval by fusing visual and scene text features of images

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention

A Reconstruction-based Visual-Acoustic-Semantic Embedding Method for Speech-Image Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Multiview adaptive attention pooling for image-text retrieval

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks