An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning

Martin Menabue,Emanuele Frascaroli,Matteo Boschini,Lorenzo Bonicelli,Angelo Porrello,Simone Calderara
2024-07-19
Abstract:The field of Continual Learning (CL) has inspired numerous researchers over the years, leading to increasingly advanced countermeasures to the issue of catastrophic forgetting. Most studies have focused on the single-class scenario, where each example comes with a single label. The recent literature has successfully tackled such a setting, with impressive results. Differently, we shift our attention to the multi-label scenario, as we feel it to be more representative of real-world open problems. In our work, we show that existing state-of-the-art CL methods fail to achieve satisfactory performance, thus questioning the real advance claimed in recent years. Therefore, we assess both old-style and novel strategies and propose, on top of them, an approach called Selective Class Attention Distillation (SCAD). It relies on a knowledge transfer technique that seeks to align the representations of the student network -- which trains continuously and is subject to forgetting -- with the teacher ones, which is pretrained and kept frozen. Importantly, our method is able to selectively transfer the relevant information from the teacher to the student, thereby preventing irrelevant information from harming the student's performance during online training. To demonstrate the merits of our approach, we conduct experiments on two different multi-label datasets, showing that our method outperforms the current state-of-the-art Continual Learning methods. Our findings highlight the importance of addressing the unique challenges posed by multi-label environments in the field of Continual Learning. The code of SCAD is available at <a class="link-external link-https" href="https://github.com/aimagelab/SCAD-LOD-2024" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in multi - label continual learning (MLCL), the existing advanced methods perform poorly when dealing with multi - label data, especially in the face of the problem of catastrophic forgetting. Specifically: 1. **Limitations of existing methods**: Although many existing continual learning methods have achieved remarkable results in single - label continual learning tasks, in multi - label scenarios, the performance of these methods is far from satisfactory. This indicates that the existing continual learning methods have serious deficiencies when dealing with multi - label data. 2. **Challenges in multi - label scenarios**: Multi - label continual learning is more challenging than single - label continual learning because each sample may belong to multiple categories simultaneously, which increases the difficulty for the model to retain old knowledge while continuously learning new tasks. 3. **Proposed new method**: To address the above challenges, the authors propose a new method named Selective Class Attention Distillation (SCAD). This method effectively alleviates the catastrophic forgetting problem by selectively transferring relevant information from the pre - trained teacher model to the student model through knowledge transfer techniques. ### Main contributions - **Benchmark testing**: The authors conducted experiments on two multi - label continual learning benchmark datasets (IIRC CIFAR - 100 and Incremental WebVision) to verify the effectiveness of the SCAD method. - **Performance improvement**: The experimental results show that the SCAD method outperforms the existing continual learning methods on these two benchmark datasets, especially when dealing with multi - label data. - **Analysis and evaluation**: The authors comprehensively evaluated the performance of the model through multiple metrics (such as the final average PWJS and the adjusted forgetting metric FGf), further demonstrating the advantages of the SCAD method. ### Formula summary - **Final average PWJS**: \[ AR_f=\frac{1}{N}\sum_{m = 1}^N R_{N,m} \] where \(R_{j,k}\) represents the performance of the model on the \(k\) - th task after training the \(j\) - th task, and \(N\) is the total number of tasks. - **Adjusted forgetting metric FGf**: \[ FG_f=\frac{1}{N - 1}\sum_{m = 1}^{N - 1}\left[\frac{R^*_{m}-R_{N,m}}{R^*_{m}}\right]^+ \] where \(R^*_{m}=\max_{t\in\{m,\ldots,N - 1\}}R_{t,m}\), \([x]^+=\max(x,0)\). ### Conclusion By proposing the SCAD method, the authors have successfully solved the catastrophic forgetting problem in multi - label continual learning and verified its effectiveness on multiple benchmark datasets. This research provides a new solution in the field of multi - label continual learning and has important theoretical and practical significance.