Class-Incremental Learning for Sound Event Localization and Detection

Ruchi Pandey,Manjunath Mulimani,Archontis Politis,Annamaria Mesaros
2024-11-20
Abstract:This paper investigates the feasibility of class-incremental learning (CIL) for Sound Event Localization and Detection (SELD) tasks. The method features an incremental learner that can learn new sound classes independently while preserving knowledge of old classes. The continual learning is achieved through a mean square error-based distillation loss to minimize output discrepancies between subsequent learners. The experiments are conducted on the TAU-NIGENS Spatial Sound Events 2021 dataset, which includes 12 different sound classes and demonstrate the efficacy of proposed method. We begin by learning 8 classes and introduce the 4 new classes at next stage. After the incremental phase, the system is evaluated on the full set of learned classes. Results show that, for this realistic dataset, our proposed method successfully maintains baseline performance across all metrics.
Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to implement Class - Incremental Learning (CIL) in Sound Event Localization and Detection (SELD) tasks, so that the model can independently learn new sound classes without retraining all previous data and maintain the ability to recognize old classes. ### Specific description of the problem 1. **Limitations of existing methods**: - Current SELD models are usually trained on a fixed set of sound classes. This means that once the model is trained, if new sound classes need to be added, the entire model must be retrained or fine - tuned. However, fine - tuning may lead to catastrophic forgetting, that is, the model forgets the knowledge of old classes when learning new classes. 2. **Requirements in practical application scenarios**: - In practical applications, such as surveillance, robots, and smart home devices, the system needs to be flexible and be able to dynamically add new sound classes without retraining the entire model. This not only improves the adaptability of the system but also reduces the computational cost. ### Solutions proposed in the paper The paper proposes a method based on Class - Incremental Learning (CIL - SELD), which solves the above problems in the following ways: - **Phased learning**: First, train a base model so that it can recognize 8 initial sound classes. Then, introduce 4 new sound classes in the incremental phase without retraining the entire model. - **Output Distillation Loss**: To prevent catastrophic forgetting, use the Mean Squared Error (MSE) as the distillation loss function. This loss function ensures that when new classes are introduced, the model's predicted output for old classes remains consistent with the output in the previous stage. The specific formula is as follows: \[ L=(1 - \lambda) L_{\text{MSE}}+\lambda L_{\text{OD}} \] where: - \( L_{\text{MSE}} \) is the Mean Squared Error loss for 4 new classes. - \( L_{\text{OD}} \) is the distillation loss, which is used to minimize the output difference between the old model (Stage 0) and the updated model (Stage 1) on the original 8 classes. - \( \lambda \) is a balancing parameter that controls the trade - off between learning new knowledge and retaining old knowledge. In this way, the CIL - SELD method can effectively maintain the ability to recognize old classes while continuously introducing new classes, thus achieving a more flexible and efficient SELD system.