MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

Huiyu Xiong,Lanxiao Wang,Heqian Qiu,Taijin Zhao,Benliu Qiu,Hongliang Li
2024-02-28
Abstract:To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the Catastrophic Forgetting caused by continuous input in the multimodal video captioning task. Specifically, when the model is trained on new categories, if there is no visibility of old data, the model may forget the knowledge it has learned before. This phenomenon is particularly prominent in video captioning tasks because video captioning tasks involve the fusion of visual and text modalities and are more complex than simple classification tasks. Therefore, the paper proposes a method to alleviate the problem of Catastrophic Forgetting in class - incremental learning, namely MCF - VC (Mitigate Catastrophic Forgetting in class - incremental learning for multimodal Video Captioning). The main contributions of the paper include: 1. For the first time, it explores the incremental learning problem in video captioning tasks and proposes a method for the backbone structure of video captioning to deal with the problems of old - task forgetting and poor new - task performance caused by continuous input. 2. In view of the particularity of incremental video captioning tasks, it improves the backbone network to make it more suitable for accepting continuous - input subtasks. 3. It designs a selector to adaptively select valuable information from fine - grained old training parameters to improve the performance of the target model. 4. For the outputs of different stages of the backbone network, it proposes two different distillation methods to constrain the representation of cross - modal and text information, so that the model can balance old and new knowledge. 5. Experiments on the public dataset MSR - VTT show that the incremental method proposed in this paper significantly improves performance in both the metrics for evaluating class - incremental video captioning forgetting and the metrics for natural language processing. Through these contributions, the paper provides an effective solution that can significantly resist the forgetting of previous tasks without replaying old samples and perform well on new tasks.