Continual Audio-Visual Sound Separation

Weiguo Pian,Yiyang Nan,Shijian Deng,Shentong Mo,Yunhui Guo,Yapeng Tian
2024-11-05
Abstract:In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{<a class="link-external link-https" href="https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Machine Learning,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform audio - visual sound separation in the context of continual learning. Specifically, the researchers hope to develop a model that can continuously separate sound sources in new categories while maintaining performance on previously learned categories. This is very important in practical applications because the sound environment in the real world is dynamically changing, new sound sources keep emerging, and existing audio - visual separation models often encounter catastrophic forgetting when facing these new categories, that is, they forget the knowledge of old tasks when learning new tasks. Therefore, the goal of the paper is to propose a method to alleviate this problem, enabling the model to continuously adapt to new sound categories without forgetting old knowledge. To achieve this goal, the paper proposes a method named ContA V - Sep. This method reduces the risk of catastrophic forgetting by introducing the Cross - modal Similarity Distillation Constraint (CrossSDC) to maintain cross - modal semantic similarity between different tasks and retain the previously learned semantic similarity knowledge. CrossSDC can be seamlessly integrated into the training processes of different audio - visual sound separation frameworks. Experimental results show that ContA V - Sep can effectively alleviate catastrophic forgetting and significantly outperform other baseline methods in the audio - visual sound separation tasks of continual learning.