Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Pengkun Jiao,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
2024-11-19
Abstract:Fine-tuning multimodal large language models (MLLMs) presents significant challenges, including a reliance on high-level visual features that limits fine-grained detail comprehension, and data conflicts that arise from task complexity. To address these issues, we propose an efficient fine-tuning framework with two novel approaches: Vision Cue Enhancement (VCE) and Dual Low-Rank Adaptation (Dual-LoRA). VCE enhances the vision projector by integrating multi-level visual cues, improving the model's ability to capture fine-grained visual features. Dual-LoRA introduces a dual low-rank structure for instruction tuning, decoupling learning into skill and task spaces to enable precise control and efficient adaptation across diverse tasks. Our method simplifies implementation, enhances visual comprehension, and improves adaptability. Experiments on both downstream tasks and general benchmarks demonstrate the effectiveness of our proposed approach.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve two main problems faced by multimodal large - language models (MLLMs) during the fine - tuning process: 1. **Insufficient Visual Feature Representation**: During the pre - training stage of the visual projector, existing methods mainly rely on high - level visual features and often overlook low - level and fine - grained details, which limits the model's ability to understand visual information. For example, most works only use high - level semantic feature maps for visual token projection, and although some works utilize multi - layer visual features, they still lack more detailed information. 2. **Data Conflict**: During the instruction fine - tuning stage, as the diversity and complexity of downstream tasks increase, the data conflict problem in LoRA - based instruction fine - tuning becomes more and more prominent. To alleviate this problem, recent studies have attempted to introduce the Mixture of Experts (MoE) paradigm into the LoRA module, leveraging their respective specific advantages by embedding multiple LoRA units in one linear layer. However, such a complex design not only increases the implementation complexity but may also prolong the training and inference time. To solve these problems, the authors propose two new methods: - **Visual Cue Enhancement (VCE)**: Enhance the visual projector by integrating multi - level visual cues to improve the model's ability to capture fine - grained visual features. - **Dual - Low - Rank Adaptation (Dual - LoRA)**: Introduce a dual - low - rank structure to decouple learning into a skill - low - rank space and a task - activation - low - rank space, thereby achieving precise control and efficient adaptation to multiple tasks. These methods simplify the implementation process, enhance visual understanding ability, and improve adaptability. Experimental results show that the proposed methods perform well in both downstream tasks and general benchmark tests.