Abstract:Training a Multimodal Large Language Model (MLLM) from scratch, like GPT-4, is resource-intensive. Regarding Large Language Models (LLMs) as the core processor for multimodal information, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external vision information. Previous methods incorporate visual information into LLMs with a simple visual mapping network or Q-former from BLIP-2. Such networks project the image feature once yet do not consider the interaction between the image and the human input query. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. LMEye addresses this issue by allowing the LLM to request the desired visual information aligned with various human instructions, which we term as the dynamic visual information interaction. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information interaction, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on some multimodal benchmarks, demonstrating that it significantly improves the zero-shot performance on various multimodal tasks compared to previous methods, with less parameters.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issues that large language models (LLMs) face when processing multimodal information, particularly how to enable LLMs to dynamically interact with external visual information. Specifically, existing methods typically project image features into the language model through simple visual mapping networks or Q-formers in a one-time manner, but these methods do not consider the interaction between image information and human input queries. Therefore, static visual information may be insufficient for LLMs to generate responses that align with human intentions. To solve this problem, the paper proposes LMEye, a human eye model with an interactive perception network. LMEye allows LLMs to request the necessary visual information based on different user instructions, thereby achieving dynamic visual information interaction. In this way, LLMs can better understand human queries, obtain corresponding visual information, and generate responses based on multimodal information. ### Main Contributions 1. **Interactive Perception Network**: Treating LLMs as multimodal information processors, an interactive perception network is proposed, enabling LLMs to request the necessary visual information based on different user queries. LLMs first understand the basic information of human queries and images, then send requests to obtain additional required visual information, and finally generate responses based on basic image information, text instructions, and interacted visual information. The entire training process is parameter-efficient. 2. **Superior Multimodal Understanding and Reasoning Performance**: LMEye performs excellently on two multimodal evaluation benchmarks (MMBench and SEED-Bench) with fewer parameters (4.4B), outperforming other multimodal large language models (MLLMs) (> 7B). 3. **Zero-Shot Performance Improvement**: Ablation experiments show that the proposed method significantly improves the performance of various scales and types of LLMs on zero-shot multimodal tasks. For example, on OK-VQA, LMEye (BLIP-2) improves by 5.0% compared to BLIP-2, and on long-answer VQA, LMEye (LLaMA-7b) improves by 20% compared to LLaVA (Vicuna-7b). ### Experimental Results 1. **MMBench Evaluation Results**: LMEye performs excellently across multiple dimensions, particularly in logical reasoning (LR), attribute reasoning (AR), and relational reasoning (RR), demonstrating its advantages in effective reasoning and connecting different pieces of information. 2. **SEED-Bench Evaluation Results**: LMEye improves scene understanding by 13 percentage points over previous state-of-the-art methods and also outperforms InstructBLIP in sample attribute recognition and spatial connection understanding, showcasing the effectiveness of the plug-in interactive perception framework. 3. **Zero-Shot Performance**: LMEye performs excellently on multiple common multimodal datasets, especially in visual question answering (VQA) and complex visual problem tasks (such as OK-VQA), significantly outperforming other methods. ### Ablation Experiments 1. **Visual Question Answering and Multimodal Reasoning**: Experimental results show that LMEye performs excellently in zero-shot visual question answering (VQA) and multimodal reasoning tasks, especially in answer choice (VCR) and short answer generation tasks (VQA), achieving good results even with only 1.7M images seen during the pre-training stage. 2. **Long-Answer VQA and Detailed Image Description**: LMEye significantly improves the performance of all generation metrics in long-answer VQA and detailed image description tasks, indicating that the multimodal instruction tuning method helps LLMs achieve image understanding capabilities similar to GPT-4. Overall, LMEye significantly enhances the performance of LLMs in multimodal tasks by introducing an interactive perception network, particularly in dynamic visual information interaction, providing a new solution for multimodal information processing.

LMEye: An Interactive Perception Network for Large Language Models

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

NoteLLM-2: Multimodal Large Representation Models for Recommendation

LLMGA: Multimodal Large Language Model based Generation Assistant

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

PerceptionGPT: Effectively Fusing Visual Perception into LLM

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

A Survey on Multimodal Large Language Models

A Survey on Evaluation of Multimodal Large Language Models

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality