Abstract:Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore the working mechanism of Multimodal In-Context Learning (M-ICL) and reveal its performance in Large Multimodal Models (LMMs). Specifically, the researchers evaluate the behavior of M-ICL across various multimodal tasks, including Visual Question Answering (VQA), image captioning, and classification tasks through a comprehensive framework. The main objectives of the paper are: 1. **Understanding the impact of each modality on M-ICL**: Investigate how text and image modalities affect the performance of M-ICL. 2. **Identifying shortcuts and limitations of M-ICL**: Explore whether M-ICL relies on certain shortcuts (e.g., majority voting) and how these shortcuts impact performance. 3. **Evaluating the effectiveness of advanced M-ICL strategies**: Study whether context selection methods based on similarity (e.g., RICES) are more effective than simple strategies (e.g., majority voting). ### Main Findings 1. **M-ICL primarily relies on text-driven mechanisms**: In most cases, M-ICL depends more on textual information, with the influence of the image modality being relatively minor. Particularly in VQA tasks, the importance of textual information surpasses that of image information. 2. **Limited effectiveness of advanced M-ICL strategies**: Even with advanced context selection strategies (e.g., RICES), the performance of M-ICL is not significantly better than that of simple majority voting strategies. 3. **Significant recency bias**: M-ICL models tend to replicate the most recent example answers in the context, revealing some limitations of the model. ### Research Methods 1. **Datasets and Models**: The researchers used multiple datasets (e.g., COCO, CIFAR-100, ImageNet, VQAv2, etc.) and state-of-the-art open-source multimodal models (e.g., IDEFICS and OpenFlamingo) for experiments. 2. **Experimental Design**: By modifying the image or text information in the context, the researchers systematically analyzed the impact of different modalities on M-ICL performance. Additionally, they used similarity-based context selection methods (RICES) to further evaluate M-ICL behavior. 3. **Statistical Analysis**: Generalized Linear Models (GLM) and Spearman’s rank correlation were used to quantify the relationships between different factors. ### Conclusion 1. **Textual information dominates in M-ICL**: In most multimodal tasks, M-ICL relies more on textual information rather than image information. 2. **Limited effectiveness of advanced M-ICL strategies**: Although similarity-based context selection methods (e.g., RICES) show some improvements in certain tasks, overall, they are not more effective than simple majority voting strategies. 3. **Recency bias is a major issue**: M-ICL models tend to replicate the most recent example answers in the context, revealing potential issues in practical applications. This paper provides important insights into understanding and optimizing Multimodal In-Context Learning, highlighting the limitations of current methods and directions for future research.

What Makes Multimodal In-Context Learning Work?

Towards Multimodal In-Context Learning for Vision & Language Models

Multimodal Contrastive In-Context Learning

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Can MLLMs Perform Text-to-Image In-Context Learning?

Multimodal Pretraining from Monolingual to Multilingual

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

In-Context Learning for Text Classification with Many Labels

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Many-Shot In-Context Learning in Multimodal Foundation Models

Towards More Unified In-context Visual Understanding

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Empowering MultiModal Models' In-Context Learning Ability through Large Language Models.

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

CaMML: Context-Aware Multimodal Learner for Large Models