Abstract:Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate a backbone LLM with a pre-trained feature encoder for downstream tasks. The major challenge is how to efficiently find the synergy through cooperative learning where LLMs adapt their reasoning abilities in downstream tasks while feature encoders adjust their encoding to provide more relevant modal information. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find unbalanced learning between the two components, i.e., the feature encoder and the LLM, can cause diminishing learning gradients that slow the model convergence and often lead to sub-optimal results due to insufficient learning. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance, based on which we further design a dynamic learning scheduler that better coordinates the learning. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs considering the learning state of each model component, which potentially prevents each component from gradient diminishing and enables a more accurate estimation of the learning balance coefficient. We conduct experiments with multiple LLM backbones and feature encoders, where our techniques are model-agnostic and can be generically integrated with various MLLM backbones. Experiment results on multiple downstream tasks and modalities in vision and audio, demonstrate the proposed method's better efficiency and effectiveness in MLLM instruction tuning.

Multimodal Instruction Tuning with Conditional Mixture of LoRA

Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models

MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning

CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

MultiLoRA: Democratizing LoRA for Better Multi-Task Learning

A Framework to Implement 1+N Multi-task Fine-tuning Pattern in LLMs Using the CGC-LORA Algorithm

Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning

MTL-LoRA: Low-Rank Adaptation for Multi-Task Learning

Demystifying Instruction Mixing for Fine-tuning Large Language Models

MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning

MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning

MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

MoR: Mixture of Ranks for Low-Rank Adaptation Tuning

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

M^2PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Chain-of-LoRA: Enhancing the Instruction Fine-Tuning Performance of Low-Rank Adaptation on Diverse Instruction Set

M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning

Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations

MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models