Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Xiujun Li,Yujie Lu,Zhe Gan,Jianfeng Gao,William Yang Wang,Yejin Choi

2024-06-11

Abstract:Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM) setting and VIM setting. Notably, we observe a significant performance disparity between the original TEM and VIM settings for open-source MLLMs, indicating that open-source MLLMs face greater challenges when text instruction is presented solely in image form. To address this issue, we train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of Multimodal Large Language Models (MLLMs) when processing Visual Modality Instructions (VIM). Specifically, the authors introduce a new setting - Visual Modality Instruction (VIM), in which text instructions are embedded in images rather than provided to the model in plain text form. In this way, researchers hope to explore whether existing MLLMs can understand and follow such text instructions embedded in images, especially without special training. The main contributions of the paper include: 1. **Proposing VIM**: A challenging setting for testing the performance of MLLMs under visual modality instructions. 2. **Adapting to multiple benchmarks**: Applying VIM to multiple existing benchmark tests, revealing significant performance differences between open - source MLLMs in Text Modality Instruction (TEM) and Visual Modality Instruction (VIM) settings. 3. **Training V - MLLM**: Developing a new model, V - MLLM, which shows strong instruction - following capabilities under both text modality and visual modality instructions. Through these efforts, researchers not only point out the deficiencies of existing open - source MLLMs in processing visual modality instructions but also provide a solution, that is, enhancing the model's performance in this regard through specific training methods. This provides important references and directions for future research and applications.

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

VIGC: Visual Instruction Generation and Correction

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Towards Multimodal In-Context Learning for Vision & Language Models

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

VILA: On Pre-training for Visual Language Models

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Instruction-Guided Visual Masking

What Large Language Models Bring to Text-rich VQA?

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models