Abstract:Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.

Continual Instruction Tuning for Large Multimodal Models

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Separable Mixture of Low-Rank Adaptation for Continual Visual Instruction Tuning

CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

SwitchCIT: Switching for Continual Instruction Tuning of Large Language Models

LLaCA: Multimodal Large Language Continual Assistant

Aligning Large Multi-Modal Model with Robust Instruction Tuning

Error-driven Data-efficient Large Multimodal Model Tuning

MLAN: Language-Based Instruction Tuning Improves Zero-Shot Generalization of Multimodal Large Language Models

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer

Towards Robust Instruction Tuning on Multimodal Large Language Models

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning

Balancing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

Vision-Language Instruction Tuning: A Review and Analysis

CoTBal: Comprehensive Task Balancing for Multi-Task Visual Instruction Tuning

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Fine-tuning Large Language Models with Sequential Instructions

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity