Abstract:To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the Catastrophic Forgetting caused by continuous input in the multimodal video captioning task. Specifically, when the model is trained on new categories, if there is no visibility of old data, the model may forget the knowledge it has learned before. This phenomenon is particularly prominent in video captioning tasks because video captioning tasks involve the fusion of visual and text modalities and are more complex than simple classification tasks. Therefore, the paper proposes a method to alleviate the problem of Catastrophic Forgetting in class - incremental learning, namely MCF - VC (Mitigate Catastrophic Forgetting in class - incremental learning for multimodal Video Captioning). The main contributions of the paper include: 1. For the first time, it explores the incremental learning problem in video captioning tasks and proposes a method for the backbone structure of video captioning to deal with the problems of old - task forgetting and poor new - task performance caused by continuous input. 2. In view of the particularity of incremental video captioning tasks, it improves the backbone network to make it more suitable for accepting continuous - input subtasks. 3. It designs a selector to adaptively select valuable information from fine - grained old training parameters to improve the performance of the target model. 4. For the outputs of different stages of the backbone network, it proposes two different distillation methods to constrain the representation of cross - modal and text information, so that the model can balance old and new knowledge. 5. Experiments on the public dataset MSR - VTT show that the incremental method proposed in this paper significantly improves performance in both the metrics for evaluating class - incremental video captioning forgetting and the metrics for natural language processing. Through these contributions, the paper provides an effective solution that can significantly resist the forgetting of previous tasks without replaying old samples and perform well on new tasks.

MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

Mitigating Catastrophic Forgetting in Task-Incremental Continual Learning with Adaptive Classification Criterion

When Video Classification Meets Incremental Classes

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Knowledge Restore and Transfer for Multi-label Class-Incremental Learning

Multimodal Memory Modelling for Video Captioning

Overcoming Catastrophic Forgetting for Multi-Label Class-Incremental Learning

Mixup-Inspired Video Class-Incremental Learning

Adaptive Curriculum Learning for Video Captioning.

Adaptive online continual multi-view learning

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Incremental Model Enhancement Via Memory-based Contrastive Learning

Class-Incremental Learning with Multiscale Distillation for Weakly Supervised Temporal Action Localization.

Learning without Forgetting for Vision-Language Models

Exemplar Masking for Multimodal Incremental Learning

Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class Incremental Learning

Continual Recognition with Adaptive Memory Update.

Heterogeneous Forgetting Compensation for Class-Incremental Learning

Multi-view class incremental learning

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer

More Classifiers, Less Forgetting: A Generic Multi-classifier Paradigm for Incremental Learning