Abstract:State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.

What problem does this paper attempt to address?

The problem this paper attempts to address is the inadequacy of existing information retrieval models in handling multimodal (text, image, etc.) queries and documents. Specifically: 1. **Limitations of Existing Models**: - Existing retrieval models typically support only single-modal queries and retrieval results, such as text-to-text retrieval tasks. - These models perform poorly when handling complex queries that include both text and images. - Existing models exhibit modality bias in cross-modal retrieval tasks, tending to retrieve relevant text rather than images when given a text query. 2. **Objectives**: - To build a Universal Multimodal Retrieval model capable of handling various modalities of queries and documents, supporting diverse retrieval tasks. - To improve the model's performance in complex multimodal query tasks, particularly in visual question answering and composite image retrieval. - To mitigate modality bias and enhance the overall retrieval capability of the model through modality-aware hard negative mining and continuous text-to-text retrieval fine-tuning. 3. **Methods**: - Using Multimodal Large Language Models (MLLMs) as the foundation, fine-tuning the model with task-specific instructions to enable understanding of complex multimodal queries. - Proposing a modality-aware hard negative mining method to reduce modality bias. - Continuously fine-tuning the model to enhance its text-to-text retrieval capability while maintaining multimodal retrieval ability. - Exploring the use of zero-shot prompting with multimodal large language models for re-ranking to further improve retrieval accuracy. 4. **Contributions**: - For the first time, investigating how to fine-tune multimodal large language models to achieve universal multimodal retrieval while maintaining strong text-to-text retrieval capabilities. - Achieving state-of-the-art performance on multiple benchmark datasets, particularly on the multimodal retrieval benchmark M-BEIR and the text retrieval benchmark MTEB. - Exploring the use of zero-shot prompting with multimodal large language models for re-ranking, significantly enhancing the performance of complex multimodal retrieval tasks. Through these methods and contributions, this paper aims to advance information retrieval technology, enabling it to better handle diverse multimodal queries and documents in real-world scenarios.

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

E5-V: Universal Embeddings with Multimodal Large Language Models

How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

Modal Complementarity Based on Multimodal Large Language Model for Text-Based Person Retrieval

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines

NoteLLM-2: Multimodal Large Representation Models for Recommendation

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

MuMUR : Multilingual Multimodal Universal Retrieval

Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval

Effective Deep Learning-Based Multi-Modal Retrieval

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

Needle In A Multimodal Haystack

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models