Can MLLMs Perform Text-to-Image In-Context Learning?

Yuchen Zeng,Wonjun Kang,Yicong Chen,Hyung Il Koo,Kangwook Lee
2024-07-20
Abstract:The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at <a class="link-external link-https" href="https://github.com/UW-Madison-Lee-Lab/CoBSAT" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Whether multimodal large language models (MLLMs) can perform Text - to - Image In - Context Learning (T2I - ICL) tasks**. Specifically, the paper aims to fill the gap in the existing research where the exploration of T2I - ICL is insufficient, and systematically evaluate the performance of MLLMs in this task by defining the T2I - ICL task and constructing the corresponding benchmark dataset. ### Problem Background 1. **Development of Multimodal Large Language Models (MLLMs)**: - MLLMs extend traditional large language models (LLMs) to enable them to process data in multiple modalities such as text, image, video, and audio. 2. **Application of In - Context Learning (ICL)**: - ICL is a technique that can make predictions based on context without updating model parameters. It was initially applied to pure - text tasks (Textual ICL) and later extended to visual tasks (Visual ICL) and multimodal tasks (Multimodal ICL). 3. **Current Research Status**: - Existing M - ICL research mainly focuses on Image - to - Text ICL (I2T - ICL) tasks, while T2I - ICL has been less studied. T2I - ICL involves converting text input into image output and has unique complexity and potential application value. ### Main Contributions of the Paper 1. **Identifying Important Problems**: - The paper first identifies and defines this important but not yet fully explored ICL setting of T2I - ICL. 2. **Introducing the CoBSAT Benchmark Dataset**: - CoBSAT is a comprehensive benchmark dataset covering five topics (color, background, style, action, texture), and each topic has two types of tasks: object inference tasks and attribute inference tasks. This dataset is used to systematically evaluate the ability of MLLMs in T2I - ICL tasks. 3. **Evaluating the T2I - ICL Ability of MLLMs**: - The performance of ten state - of - the - art MLLMs in T2I - ICL tasks was evaluated using the CoBSAT dataset. The results show that although some models such as SEED - LLaMA, Gemini, and Qwen - VL show certain capabilities, the overall accuracy still needs to be improved. 4. **Understanding the Challenges of T2I - ICL**: - The study found that the low performance of MLLMs in T2I - ICL tasks is mainly attributed to two main reasons: (i) the inherent complexity of processing multimodal data; (ii) the difficulty of the image generation task itself. 5. **Enhancing the T2I - ICL Ability of MLLMs**: - Multiple techniques were explored to enhance the T2I - ICL ability of MLLMs, including fine - tuning and Chain - of - Thought prompting. The research shows that these methods can significantly improve T2I - ICL performance. ### Conclusion The paper reveals the potential and challenges of MLLMs in T2I - ICL tasks by defining the T2I - ICL task, constructing the CoBSAT dataset, and evaluating the performance of multiple MLLMs. Future research can further optimize these models to better handle complex multimodal tasks.