Abstract:Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: achieving continual instruction tuning in large vision - language models (LVLMs) to adapt to the ever - changing and diverse task requirements while minimizing the forgetting of previously learned knowledge. Specifically, most of the existing LVLMs are limited to single - task adaptation, while in the real world, task requirements are variable and continuously evolving. Therefore, an ideal LVLM should be able to continuously perform instruction tuning when facing streaming task distributions (such as different domains, emerging capabilities, and new datasets) and minimize the forgetting of previous knowledge. To solve this problem, the authors make the following contributions: 1. **COAST Benchmark**: By collecting and re - using existing benchmarks, a new COAST (COntinu Al inStruction Tuning) benchmark is created, covering three continuous learning settings: domain - incremental, capability - incremental, and dataset - incremental. 2. **Continual LLaVA Model**: A new method named Continual LLaVA is proposed. This method freezes the pre - trained LVLM and constructs dual increment embeddings to achieve parameter - efficient continual instruction tuning. Specifically: - **Intrinsic Increment Embeddings**: It encodes task - specific features and is achieved by selecting candidate embeddings similar to user instructions from a low - rank pool. - **Contextual Increment Embeddings**: It explores the dependencies between tasks and provides supplementary information by weighted aggregation of the low - rank embeddings selected in previous tasks. 3. **Experimental Results**: The experimental results show that Continual LLaVA outperforms other methods on the COAST benchmark, especially showing significant improvements in average accuracy and average forgetting rate. For example, on COAST - domain, the average accuracy of Continual LLaVA is 13.06% higher than that of sequential training, and the average forgetting rate is also significantly reduced. ### Markdown Representation of Formulas 1. **Low - Rank Decomposition Formula**: \[ P_n = A_n\cdot B_n \] where \(P_n\in\mathbb{R}^{D\times D}\) is generated by the product of learnable matrices \(A_n\in\mathbb{R}^{D\times R}\) and \(B_n\in\mathbb{R}^{R\times D}\), and \(R\ll D\). 2. **Selecting Intrinsic Increment Embeddings**: \[ I = \arg\top_{n\in[1, N]}\cos(k_n, q_i^t) \] where \(I\) is the selected index set, and \(\cos(\cdot,\cdot)\) represents cosine similarity calculation. 3. **Generating Intrinsic Increment Embeddings**: \[ \Delta\theta_i^t=\frac{\sum_{m = 1}^M\cos(q_i^t, k_{i_m})\cdot P_{i_m}}{\sum_{m = 1}^M\cos(q_i^t, k_{i_m})} \] 4. **Generating Contextual Increment Embeddings**: \[ \Delta\delta_i^t=\sum_{l = 1}^t w_l\text{sg}(Z_l) \] where \(w_l\in[0, 1]\) is a learnable weight, \(Z_l\) is the instance - average pooling result of all selected increment embeddings in the \(l\) - th task.

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

Continual Instruction Tuning for Large Multimodal Models

LLaCA: Multimodal Large Language Continual Assistant

Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions

Vision-Language Instruction Tuning: A Review and Analysis

CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

SwitchCIT: Switching for Continual Instruction Tuning of Large Language Models

Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model

Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

Continual Learning for Large Language Models: A Survey

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Enhancing Task Performance in Continual Instruction Fine-tuning Through Format Uniformity

VIGC: Visual Instruction Generation and Correction

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Rethinking Overlooked Aspects in Vision-Language Models

Improved Baselines with Visual Instruction Tuning