Continual Audio-Visual Sound Separation

Weiguo Pian,Yiyang Nan,Shijian Deng,Shentong Mo,Yunhui Guo,Yapeng Tian

2024-11-05

Abstract:In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{<a class="link-external link-https" href="https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition,Machine Learning,Multimedia,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform audio - visual sound separation in the context of continual learning. Specifically, the researchers hope to develop a model that can continuously separate sound sources in new categories while maintaining performance on previously learned categories. This is very important in practical applications because the sound environment in the real world is dynamically changing, new sound sources keep emerging, and existing audio - visual separation models often encounter catastrophic forgetting when facing these new categories, that is, they forget the knowledge of old tasks when learning new tasks. Therefore, the goal of the paper is to propose a method to alleviate this problem, enabling the model to continuously adapt to new sound categories without forgetting old knowledge. To achieve this goal, the paper proposes a method named ContA V - Sep. This method reduces the risk of catastrophic forgetting by introducing the Cross - modal Similarity Distillation Constraint (CrossSDC) to maintain cross - modal semantic similarity between different tasks and retain the previously learned semantic similarity knowledge. CrossSDC can be seamlessly integrated into the training processes of different audio - visual sound separation frameworks. Experimental results show that ContA V - Sep can effectively alleviate catastrophic forgetting and significantly outperform other baseline methods in the audio - visual sound separation tasks of continual learning.

Continual Audio-Visual Sound Separation

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding

Each Perform Its Functions: Task Decomposition and Feature Assignment for Audio-Visual Segmentation

Cross-modal Cognitive Consensus guided Audio-Visual Segmentation

Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues

Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Audio-Visual Segmentation

Audio-Visual Instance Segmentation

Improving Audio-Visual Segmentation with Bidirectional Generation.

Audio-Visual Segmentation with Semantics

Audio-Visual Class-Incremental Learning

Co-Separating Sounds of Visual Objects

High-Quality Visually-Guided Sound Separation from Diverse Categories

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

Leveraging Foundation models for Unsupervised Audio-Visual Segmentation