Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Chi Chen,Ruoyu Qin,Fuwen Luo,Xiaoyue Mi,Peng Li,Maosong Sun,Yang Liu
2023-09-14
Abstract:Recently, Multimodal Large Language Models (MLLMs) that enable Large Language Models (LLMs) to interpret images through visual instruction tuning have achieved significant success. However, existing visual instruction tuning methods only utilize image-language instruction data to align the language and image modalities, lacking a more fine-grained cross-modal alignment. In this paper, we propose Position-enhanced Visual Instruction Tuning (PVIT), which extends the functionality of MLLMs by integrating an additional region-level vision encoder. This integration promotes a more detailed comprehension of images for the MLLM. In addition, to efficiently achieve a fine-grained alignment between the vision modules and the LLM, we design multiple data generation strategies to construct an image-region-language instruction dataset. Finally, we present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model. Code and data will be released at <a class="link-external link-https" href="https://github.com/PVIT-official/PVIT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the shortcomings of existing Multimodal Large Language Models (MLLMs) in fine-grained image understanding. Specifically, current visual instruction tuning methods only use image-language instruction data to align language and image modalities, lacking more detailed cross-modal alignment capabilities. This results in limited ability of the models to recognize specific objects in complex scenes and difficulty in handling fine-grained instructions containing spatial information (e.g., "What is the object in this area?"). Therefore, the paper proposes a Position-enhanced Visual Instruction Tuning (PVIT) method, which integrates a region-level visual encoder to enhance the fine-grained image understanding and interaction capabilities of MLLMs. ### Main Contributions 1. **Proposing Position-enhanced Visual Instruction Tuning (PVIT)**: This method introduces a region-level visual encoder to extend the fine-grained understanding and interaction capabilities of MLLMs. 2. **Designing a Region-level Instruction Data Generation Scheme**: Various methods are proposed to generate region-level instruction data, and a new evaluation dataset, FineEval, is constructed specifically to assess the performance of MLLMs in following instructions that require fine-grained spatial details. 3. **Extensive Experimental Validation**: Quantitative experiments and qualitative analyses demonstrate the effectiveness of the proposed method. ### Method Overview 1. **Model Design**: - **Visual Encoder**: Used to process input images. - **Region Encoder**: Extracts region features from RegionCLIP. - **Large Language Model (LLM)**: Combines image, instruction, and region features to generate responses. 2. **Training Process**: - **First Stage**: Initialize the model, freeze the parameters of the image encoder, region encoder, and LLM, and only train the linear projection layer to align region features to the LLM's embedding space. - **Second Stage**: Fine-tune using region-level instruction data to further enhance the model's ability to handle instructions containing regions. 3. **Region-level Instruction Data Generation**: - **Dataset Conversion**: Convert existing VQA datasets to region-level instruction format. - **Task-specific Instruction Data Generation**: Use ChatGPT to generate region-level instruction data for specific tasks. - **General Instruction Data Generation**: Generate more general region-level instruction data through detailed image descriptions and automatic annotation. ### Experimental Results 1. **Object Recognition**: On the MS COCO dataset, PVIT significantly outperforms baseline models LLaVA and Shikra, and is comparable to GPT4RoI. 2. **Multimodal Reasoning**: On the GQA dataset, PVIT shows the highest performance, especially in handling instructions requiring fine-grained spatial information. 3. **Human Evaluation**: On the FineEval dataset, PVIT excels in multiple aspects (such as object recognition, attribute description, reasoning, etc.), particularly in handling complex relationships and detailed spatial information. ### Conclusion By introducing a region-level visual encoder and designing a region-level instruction data generation scheme, PVIT significantly enhances the fine-grained image understanding and interaction capabilities of MLLMs. This method provides a new direction for future research on multimodal large language models.