Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Xiujun Li,Yujie Lu,Zhe Gan,Jianfeng Gao,William Yang Wang,Yejin Choi
2024-06-11
Abstract:Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM) setting and VIM setting. Notably, we observe a significant performance disparity between the original TEM and VIM settings for open-source MLLMs, indicating that open-source MLLMs face greater challenges when text instruction is presented solely in image form. To address this issue, we train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of Multimodal Large Language Models (MLLMs) when processing Visual Modality Instructions (VIM). Specifically, the authors introduce a new setting - Visual Modality Instruction (VIM), in which text instructions are embedded in images rather than provided to the model in plain text form. In this way, researchers hope to explore whether existing MLLMs can understand and follow such text instructions embedded in images, especially without special training. The main contributions of the paper include: 1. **Proposing VIM**: A challenging setting for testing the performance of MLLMs under visual modality instructions. 2. **Adapting to multiple benchmarks**: Applying VIM to multiple existing benchmark tests, revealing significant performance differences between open - source MLLMs in Text Modality Instruction (TEM) and Visual Modality Instruction (VIM) settings. 3. **Training V - MLLM**: Developing a new model, V - MLLM, which shows strong instruction - following capabilities under both text modality and visual modality instructions. Through these efforts, researchers not only point out the deficiencies of existing open - source MLLMs in processing visual modality instructions but also provide a solution, that is, enhancing the model's performance in this regard through specific training methods. This provides important references and directions for future research and applications.