Abstract:State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Multiple Kernel Visual-Auditory Representation Learning for Retrieval

CAMVR: Context-Adaptive Multi-View Representation Learning for Dense Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

UnifieR: A Unified Retriever for Large-Scale Retrieval

Universal Multimodal Representation for Language Understanding

Effective Deep Learning-Based Multi-Modal Retrieval

MuMUR : Multilingual Multimodal Universal Retrieval

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

Multi-Modal Retrieval Via Deep Textual-Visual Correlation Learning

MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Unified Generative and Discriminative Training for Multi-modal Large Language Models

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

Effective Multi-Modal Retrieval Based on Stacked Auto-Encoders

Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

Achieving Cross Modal Generalization with Multimodal Unified Representation.

Unifying Vision-Language Representation Space with Single-tower Transformer