What Makes Multimodal In-Context Learning Work?

Folco Bertini Baldassini,Mustafa Shukor,Matthieu Cord,Laure Soulier,Benjamin Piwowarski
2024-04-25
Abstract:Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the working mechanism of Multimodal In-Context Learning (M-ICL) and reveal its performance in Large Multimodal Models (LMMs). Specifically, the researchers evaluate the behavior of M-ICL across various multimodal tasks, including Visual Question Answering (VQA), image captioning, and classification tasks through a comprehensive framework. The main objectives of the paper are: 1. **Understanding the impact of each modality on M-ICL**: Investigate how text and image modalities affect the performance of M-ICL. 2. **Identifying shortcuts and limitations of M-ICL**: Explore whether M-ICL relies on certain shortcuts (e.g., majority voting) and how these shortcuts impact performance. 3. **Evaluating the effectiveness of advanced M-ICL strategies**: Study whether context selection methods based on similarity (e.g., RICES) are more effective than simple strategies (e.g., majority voting). ### Main Findings 1. **M-ICL primarily relies on text-driven mechanisms**: In most cases, M-ICL depends more on textual information, with the influence of the image modality being relatively minor. Particularly in VQA tasks, the importance of textual information surpasses that of image information. 2. **Limited effectiveness of advanced M-ICL strategies**: Even with advanced context selection strategies (e.g., RICES), the performance of M-ICL is not significantly better than that of simple majority voting strategies. 3. **Significant recency bias**: M-ICL models tend to replicate the most recent example answers in the context, revealing some limitations of the model. ### Research Methods 1. **Datasets and Models**: The researchers used multiple datasets (e.g., COCO, CIFAR-100, ImageNet, VQAv2, etc.) and state-of-the-art open-source multimodal models (e.g., IDEFICS and OpenFlamingo) for experiments. 2. **Experimental Design**: By modifying the image or text information in the context, the researchers systematically analyzed the impact of different modalities on M-ICL performance. Additionally, they used similarity-based context selection methods (RICES) to further evaluate M-ICL behavior. 3. **Statistical Analysis**: Generalized Linear Models (GLM) and Spearman’s rank correlation were used to quantify the relationships between different factors. ### Conclusion 1. **Textual information dominates in M-ICL**: In most multimodal tasks, M-ICL relies more on textual information rather than image information. 2. **Limited effectiveness of advanced M-ICL strategies**: Although similarity-based context selection methods (e.g., RICES) show some improvements in certain tasks, overall, they are not more effective than simple majority voting strategies. 3. **Recency bias is a major issue**: M-ICL models tend to replicate the most recent example answers in the context, revealing potential issues in practical applications. This paper provides important insights into understanding and optimizing Multimodal In-Context Learning, highlighting the limitations of current methods and directions for future research.