Abstract:Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

In-Context Compositional Generalization for Large Vision-Language Models

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

Visual In-Context Learning for Large Vision-Language Models

In-Context Learning Improves Compositional Understanding of Vision-Language Models

How to Configure Good In-Context Sequence for Visual Question Answering

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Efficient Large Multi-modal Models via Visual Context Compression

Towards Multimodal In-Context Learning for Vision & Language Models

Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning

Does In-Context Learning Really Learn? Rethinking How Large Language Models Respond and Solve Tasks via In-Context Learning

Large Language Models Know What Makes Exemplary Contexts

Unifying Demonstration Selection and Compression for In-Context Learning

SADL: An Effective In-Context Learning Method for Compositional Visual QA

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

LIVE: Learnable In-Context Vector for Visual Question Answering

In-Context Learning Demonstration Selection via Influence Analysis

Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism

How Do In-Context Examples Affect Compositional Generalization?

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Revisiting Demonstration Selection Strategies in In-Context Learning

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning