Abstract:The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/UW-Madison-Lee-Lab/CoBSAT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Whether multimodal large language models (MLLMs) can perform Text - to - Image In - Context Learning (T2I - ICL) tasks**. Specifically, the paper aims to fill the gap in the existing research where the exploration of T2I - ICL is insufficient, and systematically evaluate the performance of MLLMs in this task by defining the T2I - ICL task and constructing the corresponding benchmark dataset. ### Problem Background 1. **Development of Multimodal Large Language Models (MLLMs)**: - MLLMs extend traditional large language models (LLMs) to enable them to process data in multiple modalities such as text, image, video, and audio. 2. **Application of In - Context Learning (ICL)**: - ICL is a technique that can make predictions based on context without updating model parameters. It was initially applied to pure - text tasks (Textual ICL) and later extended to visual tasks (Visual ICL) and multimodal tasks (Multimodal ICL). 3. **Current Research Status**: - Existing M - ICL research mainly focuses on Image - to - Text ICL (I2T - ICL) tasks, while T2I - ICL has been less studied. T2I - ICL involves converting text input into image output and has unique complexity and potential application value. ### Main Contributions of the Paper 1. **Identifying Important Problems**: - The paper first identifies and defines this important but not yet fully explored ICL setting of T2I - ICL. 2. **Introducing the CoBSAT Benchmark Dataset**: - CoBSAT is a comprehensive benchmark dataset covering five topics (color, background, style, action, texture), and each topic has two types of tasks: object inference tasks and attribute inference tasks. This dataset is used to systematically evaluate the ability of MLLMs in T2I - ICL tasks. 3. **Evaluating the T2I - ICL Ability of MLLMs**: - The performance of ten state - of - the - art MLLMs in T2I - ICL tasks was evaluated using the CoBSAT dataset. The results show that although some models such as SEED - LLaMA, Gemini, and Qwen - VL show certain capabilities, the overall accuracy still needs to be improved. 4. **Understanding the Challenges of T2I - ICL**: - The study found that the low performance of MLLMs in T2I - ICL tasks is mainly attributed to two main reasons: (i) the inherent complexity of processing multimodal data; (ii) the difficulty of the image generation task itself. 5. **Enhancing the T2I - ICL Ability of MLLMs**: - Multiple techniques were explored to enhance the T2I - ICL ability of MLLMs, including fine - tuning and Chain - of - Thought prompting. The research shows that these methods can significantly improve T2I - ICL performance. ### Conclusion The paper reveals the potential and challenges of MLLMs in T2I - ICL tasks by defining the T2I - ICL task, constructing the CoBSAT dataset, and evaluating the performance of multiple MLLMs. Future research can further optimize these models to better handle complex multimodal tasks.

Can MLLMs Perform Text-to-Image In-Context Learning?

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

What Makes Multimodal In-Context Learning Work?

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

Multimodal Contrastive In-Context Learning

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

Towards Multimodal In-Context Learning for Vision & Language Models

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

Link-Context Learning for Multimodal LLMs

MileBench: Benchmarking MLLMs in Long Context

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning