VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang,Rui Meng,Xinyi Yang,Semih Yavuz,Yingbo Zhou,Wenhu Chen
2024-10-11
Abstract:Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite their importance. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on Phi-3.5-V and evaluate them on MMEB's evaluation split. Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to construct a general multi - modal embedding framework to promote future research. Specifically, the researchers are concerned with two main limitations in the current multi - modal embedding field: 1. **Existing research usually evaluates visual embeddings on isolated tasks**, such as ImageNet classification or MSCOCO/Flickr retrieval tasks. This approach limits the generalization ability of the model on multiple tasks. 2. **Most existing multi - modal embedding models**, such as CLIP, BLIP and SigLIP, either process text and images separately or perform shallow fusion of visual and text information, which limits their ability to capture cross - modal relationships. In addition, these models have limited generalization ability in zero - shot complex reasoning tasks. To overcome these limitations, the authors make two main contributions: 1. **MMEB (Massive Multi - Modal Embedding Benchmark)**: This is a comprehensive benchmark set containing 36 datasets, covering four meta - task categories: classification, visual question answering, retrieval and visual localization. MMEB provides a comprehensive framework for training and evaluating multi - modal embedding models, supporting various combinations of text and image modalities. 2. **VL M2VEC (Vision - Language Model to Vector)**: This is a contrastive training framework that can convert any state - of - the - art vision - language model into an embedding model. Unlike CLIP and BLIP, VL M2VEC can handle any combination of images and text and generate fixed - dimension vectors according to task instructions. Through these contributions, the authors hope to promote the research progress in the field of multi - modal embedding, especially in terms of generalization ability and cross - modal relationship capture. Experimental results show that VL M2VEC has achieved significant performance improvements on the 36 datasets of MMEB, especially outstanding in zero - shot evaluation.