Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Didi Zhu,Zhongyi Sun,Zexi Li,Tao Shen,Ke Yan,Shouhong Ding,Kun Kuang,Chao Wu
2024-02-19
Abstract:Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ($\leq$ 10\%) of fine-tuned parameters, maintaining $\sim$ 99\% effectiveness on original tasks versus pre-training, and achieving $\sim$ 97\% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the "model patch", based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to "decorate the patch", enhancing the model's performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1.5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problem of catastrophic forgetting encountered by multimodal large - scale language models (MLLMs) during the fine - tuning process. Specifically, when fine - tuning MLLMs to improve their performance on new tasks, it often leads to a significant decline in the model's performance on the original tasks. This phenomenon is particularly prominent in multimodal models because these models need to process data of different modalities, increasing the difficulty of task generalization. ### Background of the Paper and Problem Definition In recent years, the development of large - scale language models (LLMs) has significantly promoted the progress of artificial intelligence technology. Especially after introducing other modalities such as vision, multimodal large - scale language models (MLLMs) have been formed. However, these models perform poorly when facing unseen tasks. Although the traditional fine - tuning method can improve the performance of new tasks, it will seriously damage the model's performance on the original tasks, that is, catastrophic forgetting. ### Research Motivation How to enhance the performance of MLLMs on the target tasks without reducing their effectiveness on the original tasks? This is the core research question of this paper. Current methods are mainly aimed at small models and rely on full - model fine - tuning, which is difficult to scale to large models in terms of computational and storage costs. In addition, existing parameter - efficient methods such as Low - Rank Adaptation (LoRA) have limited effectiveness in alleviating catastrophic forgetting although they reduce the computational and memory burdens. ### Solution To solve the above problems, this paper proposes a parameter - efficient post - training adjustment method - **Model Tailor**. This method is implemented through the following steps: 1. **Identify Model Patch**: - Use a sparse mask to identify a critical subset of the fine - tuning parameters, which are crucial for improving the performance of the target task. This process is based on a fusion strategy of parameter change and loss change, combined with saliency and sensitivity analysis. 2. **Decorate the Patch**: - Compensate for the selected critical parameters to mitigate the decline in the performance of the target task caused by removing other non - critical fine - tuning parameters. This compensation mechanism is based on the inverse of the Hessian matrix for precise weight adjustment. ### Mathematical Representation - **Fusion Function**: \[ \Theta_{\text{fusion}} = F(\Theta_{\text{sft}}, \Theta_{\text{pre}}) \] where \(\Theta_{\text{fusion}}\) represents the optimized fusion parameters, and \(\Theta_{\text{sft}}\) and \(\Theta_{\text{pre}}\) represent the fine - tuning parameters and pre - training parameters respectively. - **Optimization Objective**: \[ \begin{aligned} &\text{minimize} \quad L_T(\Theta_{\text{fusion}}) - L_T(\Theta_{\text{sft}}) \leq \epsilon_t, \\ &\text{subject to} \quad L_P(\Theta_{\text{fusion}}) - L_P(\Theta_{\text{pre}}) \leq \epsilon_p, \] where \(L_T\) and \(L_P\) represent the loss functions of the target task and the pre - training task respectively, and \(\epsilon_t\) and \(\epsilon_p\) represent the acceptable performance decline thresholds. ### Experimental Results Through experiments on the InstructBLIP and LLaVA - 1.5 models, Model Tailor has shown significant effects. The experimental results show that Model Tailor can effectively improve the performance of the target task while maintaining the performance of the original task. Specifically, experiments on multiple datasets show that Model Tailor not only in the single - task scenario.