Abstract:State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.

Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Topological Perspectives on Optimal Multimodal Embedding Spaces

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning.

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Explaining Multi-modal Large Language Models by Analyzing their Vision Perception

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Learning Multi-Modal Word Representation Grounded in Visual Context

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Universal Multimodal Representation for Language Understanding

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Using Multimodal Deep Neural Networks to Disentangle Language from Visual Aesthetics

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Learning semantic sentence representations from visually grounded language without lexical knowledge

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs