Abstract:Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

Towards Multimodal-augmented Pre-trained Language Models Via Self-balanced Expectation-Maximization Iteration

Multimodal Pretraining from Monolingual to Multilingual

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Dynamic Weighted Multitask Learning and Contrastive Learning for Multimodal Sentiment Analysis

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Multimodal Molecular Pretraining via Modality Blending

eP-ALM: Efficient Perceptual Augmentation of Language Models

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning