Abstract:Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.

What problem does this paper attempt to address?

The paper attempts to address the issue that the degree of dependency on different modal information for various tasks in Multimodal In-Context Learning (ICL) and the underlying principles have not been fully studied. Specifically: 1. **The role of different modal information in multimodal ICL**: - The paper systematically evaluates the performance of multimodal large language models (Multimodal LLMs) of different scales on various new tasks, exploring the importance of visual and textual modal information in multimodal ICL. - The study finds that different tasks have varying degrees of dependency on visual and textual modal information. For example, in some tasks, perturbations in visual information have little impact on performance, while in other tasks, such as Key Information Extraction (KIE), perturbations in visual information lead to significant performance degradation. 2. **Effective demonstration selection strategies**: - The paper proposes a modality-driven demonstration selection strategy to improve the performance of multimodal ICL. The specific strategies include: - **Vision-driven**: Using visual similarity (e.g., the visual encoder of the CLIP model) to select demonstrations that are visually similar to the test samples. - **Text-driven**: Using textual similarity (e.g., the text encoder of the CLIP model or BERTScore) to select demonstrations that are textually similar to the test samples. - **Bimodal-driven**: Combining visual and textual similarity (e.g., the ALBEF model) to select demonstrations. 3. **The model's ability to capture task inductive biases from multimodal ICL**: - The paper also explores whether the model can capture the inductive biases of tasks from multimodal ICL, even if these tasks are rarely seen in the pre-training data or contradict the semantic priors in the pre-training data. - Experimental results show that large-scale models are better at capturing these inductive biases and exhibit better performance in zero-shot inference. In summary, this paper aims to deeply understand the role of demonstrations in multimodal ICL through systematic analysis and experiments, and proposes effective strategies to improve the performance of multimodal ICL on various tasks.

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning

In-Context Compositional Generalization for Large Vision-Language Models

What Makes Multimodal In-Context Learning Work?

What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration

Towards Understanding In-Context Learning with Contrastive Demonstrations and Saliency Maps

Towards Multimodal In-Context Learning for Vision & Language Models

AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Multimodal Contrastive In-Context Learning

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Visual In-Context Learning for Large Vision-Language Models

Towards More Unified In-context Visual Understanding

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Revisiting Demonstration Selection Strategies in In-Context Learning

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning