Abstract:With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to utilize large - scale multimodal models (LMMs) to achieve efficient and general - purpose retrieval and re - ranking capabilities in the field of multimodal information retrieval. Specifically, existing methods usually rely on fine - tuning visual - language models for specific tasks, which is cumbersome and time - consuming when dealing with complex or diverse retrieval tasks. The paper proposes a framework named LamRA, aiming to enhance the capabilities of LMMs through lightweight LoRA modules, enabling them to handle multiple retrieval tasks and generalize to unseen retrieval tasks without additional training. ### Main Contribution Points: 1. **Introduction of the LamRA Framework**: This framework is designed to enhance the advanced retrieval and re - ranking capabilities of LMMs. 2. **Two - stage Training Strategy**: First, conduct language - only pre - training to improve the feature extraction capabilities of LMMs; then perform multimodal instruction tuning to make the model adapt to various retrieval tasks. 3. **Joint Training for Re - ranking**: Support point - to - point and list - level re - ranking, further improving retrieval performance. 4. **Extensive Experimental Verification**: Through experiments on multiple datasets, the effectiveness and robustness of LamRA in supervised and zero - sample settings are proven, especially its generalization ability on unseen retrieval tasks. ### Specific Problems Solved: - **Unifying Retrieval Tasks**: Through the LamRA framework, all retrieval tasks can be unified under the same formula, simplifying the processing flow of multimodal information retrieval. - **Generalization Ability**: LamRA not only performs excellently on known tasks but also can effectively generalize to unseen tasks, demonstrating its potential in practical applications. - **Reducing Training Costs**: Through lightweight LoRA modules and a two - stage training strategy, the dependence on a large amount of labeled data is reduced, and the training cost is lowered. ### Experimental Results: - **Multimodal and Plain - text Retrieval Tasks**: LamRA significantly outperforms existing dual - encoder methods such as UniIR - CLIP in retrieval tasks with multiple input formats. - **Generalization Ability on Unseen Datasets**: On 10 unseen datasets, LamRA performs well, and its performance either significantly exceeds other strong baseline methods or is comparable to them. - **Generalization Ability on Unseen Retrieval Tasks**: On unseen retrieval tasks, LamRA also shows strong generalization ability. For example, in the text - image - text retrieval task on the InfoSeek dataset, LamRA is 26.3 percentage points higher than UniIR - CLIP in the Recall@5 metric. In conclusion, through the LamRA framework, this paper successfully solves a series of challenges in multimodal information retrieval, especially making significant progress in generalization ability and reducing training costs.

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Retrieval-Augmented Personalization for Multimodal Large Language Models

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

Reliable, Adaptable, and Attributable Language Models with Retrieval

Bridging the Preference Gap between Retrievers and LLMs

MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training

RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

Retrieve Anything To Augment Large Language Models

RRAML: Reinforced Retrieval Augmented Machine Learning

RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Leveraging LLMs for Unsupervised Dense Retriever Ranking

LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models

Accelerating Retrieval-Augmented Language Model Serving with Speculation

Multi-Modal Retrieval For Large Language Model Based Speech Recognition