Abstract:Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite their importance. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on Phi-3.5-V and evaluate them on MMEB's evaluation split. Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to construct a general multi - modal embedding framework to promote future research. Specifically, the researchers are concerned with two main limitations in the current multi - modal embedding field: 1. **Existing research usually evaluates visual embeddings on isolated tasks**, such as ImageNet classification or MSCOCO/Flickr retrieval tasks. This approach limits the generalization ability of the model on multiple tasks. 2. **Most existing multi - modal embedding models**, such as CLIP, BLIP and SigLIP, either process text and images separately or perform shallow fusion of visual and text information, which limits their ability to capture cross - modal relationships. In addition, these models have limited generalization ability in zero - shot complex reasoning tasks. To overcome these limitations, the authors make two main contributions: 1. **MMEB (Massive Multi - Modal Embedding Benchmark)**: This is a comprehensive benchmark set containing 36 datasets, covering four meta - task categories: classification, visual question answering, retrieval and visual localization. MMEB provides a comprehensive framework for training and evaluating multi - modal embedding models, supporting various combinations of text and image modalities. 2. **VL M2VEC (Vision - Language Model to Vector)**: This is a contrastive training framework that can convert any state - of - the - art vision - language model into an embedding model. Unlike CLIP and BLIP, VL M2VEC can handle any combination of images and text and generate fixed - dimension vectors according to task instructions. Through these contributions, the authors hope to promote the research progress in the field of multi - modal embedding, especially in terms of generalization ability and cross - modal relationship capture. Experimental results show that VL M2VEC has achieved significant performance improvements on the 36 datasets of MMEB, especially outstanding in zero - shot evaluation.

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

E5-V: Universal Embeddings with Multimodal Large Language Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Multi-view visual semantic embedding for cross-modal image–text retrieval

EVLM: An Efficient Vision-Language Model for Visual Understanding

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Unified Generative and Discriminative Training for Multi-modal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

MMBench: Is Your Multi-modal Model an All-around Player?

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Are We on the Right Way for Evaluating Large Vision-Language Models?

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Veagle: Advancements in Multimodal Representation Learning