Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Yuexiang Zhai,Shengbang Tong,Xiao Li,Mu Cai,Qing Qu,Yong Jae Lee,Yi Ma

2023-12-05

Abstract:Following the success of GPT4, there has been a surge in interest in multimodal large language model (MLLM) research. This line of research focuses on developing general-purpose LLMs through fine-tuning pre-trained LLMs and vision models. However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM). In this paper, we introduce EMT: Evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs, by treating each MLLM as an image classifier. We first apply EMT to evaluate several open-source fine-tuned MLLMs and we discover that almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks. Moreover, we continue fine-tuning LLaVA, an MLLM and utilize EMT to assess performance throughout the fine-tuning. Interestingly, our results suggest that early-stage fine-tuning on an image dataset improves performance across other image datasets, by enhancing the alignment of text and visual features. However, as fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability, even when the image encoder remains frozen. Our results suggest that MLLMs have yet to demonstrate performance on par with their vision models on standard image classification tasks and the current MLLM fine-tuning procedure still has room for improvement.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper attempts to solve the problem of catastrophic forgetting in the fine - tuning process of multimodal large language models (MLLMs). Specifically, the paper focuses on the fact that when these models are fine - tuned to adapt to specific tasks, they may forget the knowledge they have learned before, especially the performance in image classification tasks will decline significantly. To study this phenomenon, the author proposes an evaluation framework named EMT (Evaluating MulTimodality), which evaluates the performance change of each MLLM after fine - tuning by regarding each MLLM as an image classifier. The main contributions of the paper include: 1. **Proposing the EMT framework**: This is the first framework specifically used to evaluate the catastrophic forgetting phenomenon in MLLMs, which measures the performance change of the model through image classification tasks. 2. **Experimental verification**: Through experiments on multiple open - source MLLMs, it is found that almost all the tested models cannot maintain the classification performance equivalent to the pre - trained visual encoder after fine - tuning. 3. **Fine - tuning experiments**: Further fine - tuning experiments are carried out on the LLaVA model, and it is found that moderate fine - tuning can improve the performance of the model on non - fine - tuning tasks, but excessive fine - tuning will lead to catastrophic forgetting and the model begins to hallucinate. In general, the paper reveals a key problem in the current MLLMs during the fine - tuning process and provides methods for evaluating and improving this problem.

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Examining Forgetting in Continual Pre-training of Aligned Large Language Models

Exploring Forgetting in Large Language Model Pre-Training

Revisiting Catastrophic Forgetting in Large Language Model Tuning

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models

Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model