Abstract:Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition

The Solution for Language-Enhanced Image New Category Discovery

Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification

TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt

LAMM: Label Alignment for Multi-Modal Prompt Learning

LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Text as Image: Learning Transferable Adapter for Multi-Label Classification

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Global-local prompts guided image-text embedding, alignment and aggregation for multi-label zero-shot learning

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification

DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Semantic Representation and Dependency Learning for Multi-Label Image Recognition

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Domain-Controlled Prompt Learning

Label prompt for multi-label text classification

Language Models as Black-Box Optimizers for Vision-Language Models