LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Yikun Liu,Pingan Chen,Jiayin Cai,Xiaolong Jiang,Yao Hu,Jiangchao Yao,Yanfeng Wang,Weidi Xie
2024-12-03
Abstract:With the rapid advancement of multimodal information retrieval, increasingly complex retrieval tasks have emerged. Existing methods predominately rely on task-specific fine-tuning of vision-language models, often those trained with image-text contrastive learning. In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. This approach enables unifying all retrieval tasks under the same formulation and, more importantly, allows for extrapolation towards unseen retrieval tasks without additional training. Our contributions can be summarised in the following aspects: (i) We introduce LamRA, a versatile framework designed to empower LMMs with sophisticated retrieval and reranking capabilities. (ii) For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning to progressively enhance LMM's retrieval performance. (iii) For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance. (iv) Extensive experimental results underscore the efficacy of our method in handling more than ten retrieval tasks, demonstrating robust performance in both supervised and zero-shot settings, including scenarios involving previously unseen retrieval tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to utilize large - scale multimodal models (LMMs) to achieve efficient and general - purpose retrieval and re - ranking capabilities in the field of multimodal information retrieval. Specifically, existing methods usually rely on fine - tuning visual - language models for specific tasks, which is cumbersome and time - consuming when dealing with complex or diverse retrieval tasks. The paper proposes a framework named LamRA, aiming to enhance the capabilities of LMMs through lightweight LoRA modules, enabling them to handle multiple retrieval tasks and generalize to unseen retrieval tasks without additional training. ### Main Contribution Points: 1. **Introduction of the LamRA Framework**: This framework is designed to enhance the advanced retrieval and re - ranking capabilities of LMMs. 2. **Two - stage Training Strategy**: First, conduct language - only pre - training to improve the feature extraction capabilities of LMMs; then perform multimodal instruction tuning to make the model adapt to various retrieval tasks. 3. **Joint Training for Re - ranking**: Support point - to - point and list - level re - ranking, further improving retrieval performance. 4. **Extensive Experimental Verification**: Through experiments on multiple datasets, the effectiveness and robustness of LamRA in supervised and zero - sample settings are proven, especially its generalization ability on unseen retrieval tasks. ### Specific Problems Solved: - **Unifying Retrieval Tasks**: Through the LamRA framework, all retrieval tasks can be unified under the same formula, simplifying the processing flow of multimodal information retrieval. - **Generalization Ability**: LamRA not only performs excellently on known tasks but also can effectively generalize to unseen tasks, demonstrating its potential in practical applications. - **Reducing Training Costs**: Through lightweight LoRA modules and a two - stage training strategy, the dependence on a large amount of labeled data is reduced, and the training cost is lowered. ### Experimental Results: - **Multimodal and Plain - text Retrieval Tasks**: LamRA significantly outperforms existing dual - encoder methods such as UniIR - CLIP in retrieval tasks with multiple input formats. - **Generalization Ability on Unseen Datasets**: On 10 unseen datasets, LamRA performs well, and its performance either significantly exceeds other strong baseline methods or is comparable to them. - **Generalization Ability on Unseen Retrieval Tasks**: On unseen retrieval tasks, LamRA also shows strong generalization ability. For example, in the text - image - text retrieval task on the InfoSeek dataset, LamRA is 26.3 percentage points higher than UniIR - CLIP in the Recall@5 metric. In conclusion, through the LamRA framework, this paper successfully solves a series of challenges in multimodal information retrieval, especially making significant progress in generalization ability and reducing training costs.