Abstract:Fine-tuning multimodal large language models (MLLMs) presents significant challenges, including a reliance on high-level visual features that limits fine-grained detail comprehension, and data conflicts that arise from task complexity. To address these issues, we propose an efficient fine-tuning framework with two novel approaches: Vision Cue Enhancement (VCE) and Dual Low-Rank Adaptation (Dual-LoRA). VCE enhances the vision projector by integrating multi-level visual cues, improving the model's ability to capture fine-grained visual features. Dual-LoRA introduces a dual low-rank structure for instruction tuning, decoupling learning into skill and task spaces to enable precise control and efficient adaptation across diverse tasks. Our method simplifies implementation, enhances visual comprehension, and improves adaptability. Experiments on both downstream tasks and general benchmarks demonstrate the effectiveness of our proposed approach.

What problem does this paper attempt to address?

This paper attempts to solve two main problems faced by multimodal large - language models (MLLMs) during the fine - tuning process: 1. **Insufficient Visual Feature Representation**: During the pre - training stage of the visual projector, existing methods mainly rely on high - level visual features and often overlook low - level and fine - grained details, which limits the model's ability to understand visual information. For example, most works only use high - level semantic feature maps for visual token projection, and although some works utilize multi - layer visual features, they still lack more detailed information. 2. **Data Conflict**: During the instruction fine - tuning stage, as the diversity and complexity of downstream tasks increase, the data conflict problem in LoRA - based instruction fine - tuning becomes more and more prominent. To alleviate this problem, recent studies have attempted to introduce the Mixture of Experts (MoE) paradigm into the LoRA module, leveraging their respective specific advantages by embedding multiple LoRA units in one linear layer. However, such a complex design not only increases the implementation complexity but may also prolong the training and inference time. To solve these problems, the authors propose two new methods: - **Visual Cue Enhancement (VCE)**: Enhance the visual projector by integrating multi - level visual cues to improve the model's ability to capture fine - grained visual features. - **Dual - Low - Rank Adaptation (Dual - LoRA)**: Introduce a dual - low - rank structure to decouple learning into a skill - low - rank space and a task - activation - low - rank space, thereby achieving precise control and efficient adaptation to multiple tasks. These methods simplify the implementation process, enhance visual understanding ability, and improve adaptability. Experimental results show that the proposed methods perform well in both downstream tasks and general benchmark tests.

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

Multimodal Instruction Tuning with Conditional Mixture of LoRA

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Instruction Tuning-free Visual Token Complement for Multimodal LLMs

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions

Dual Low-Rank Adaptation for Continual Learning with Pre-Trained Models

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Improved Baselines with Visual Instruction Tuning

VIGC: Visual Instruction Generation and Correction

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Vision-Language Instruction Tuning: A Review and Analysis

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Personalized Visual Instruction Tuning