UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Cong Wei,Yang Chen,Haonan Chen,Hexiang Hu,Ge Zhang,Jie Fu,Alan Ritter,Wenhu Chen
DOI: https://doi.org/10.48550/arXiv.2311.17136
2023-11-29
Abstract:Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing information retrieval (IR) systems in handling multimodal information retrieval tasks. Specifically, existing information retrieval models usually assume a homogeneous format, which limits their ability to meet diverse user needs, such as searching for images through text descriptions, searching for news headline images through news titles, or finding similar photos using query images. The limitations of these systems lie in the fact that they are mainly concentrated in homogeneous retrieval scenarios in specific fields and cannot adapt to diverse information needs across fields and modalities. To address these diverse information needs, the paper introduces **UniIR**, a unified instruction - guided multimodal retriever that can handle eight different cross - modal retrieval tasks. UniIR is a single retrieval system. By jointly training multimodal IR datasets from ten different sources and performing various retrieval tasks according to user instructions, it demonstrates strong performance on existing datasets and zero - shot generalization ability for new tasks. ### Main contributions: 1. **UniIR framework**: A general multimodal information retrieval framework, aiming to integrate various multimodal retrieval tasks into a coherent system. 2. **M - BEIR benchmark**: A large - scale multimodal retrieval benchmark that brings together ten different datasets from multiple domains, covering eight different multimodal retrieval tasks. 3. **UniIR model**: A general retriever trained based on M - BEIR, laying the foundation for future research. In addition, the zero - shot performance of the state - of - the - art vision - language pre - training models on the M - BEIR benchmark is also evaluated. ### Key technologies in the solution: - **Multi - task training**: Through multi - task training, UniIR can perform well on multiple tasks, especially in terms of zero - shot generalization. - **Instruction tuning**: Instruction tuning is the key to helping the model understand the user's retrieval intention, thereby improving the retrieval accuracy. - **Multimodal fusion mechanism**: The paper explores two multimodal fusion mechanisms - score - level fusion and feature - level fusion - to improve the model's performance. ### Experimental results: - **Performance of zero - shot models**: Zero - shot models perform poorly when dealing with heterogeneous candidate pools, especially when there is no instruction guidance, and often fail to correctly understand the intention of the retrieval task. - **Effects of multi - task training and instruction tuning**: Multi - task training and instruction tuning significantly improve the model's performance, especially in terms of zero - shot generalization ability. In conclusion, by proposing the UniIR framework and the M - BEIR benchmark, this paper aims to develop a more flexible and general neural retriever to meet users' diverse information needs.