Abstract:In this paper, we introduce a novel continual audio-visual sound separation task, aiming to continuously separate sound sources for new classes while preserving performance on previously learned classes, with the aid of visual guidance. This problem is crucial for practical visually guided auditory perception as it can significantly enhance the adaptability and robustness of audio-visual sound separation models, making them more applicable for real-world scenarios where encountering new sound sources is commonplace. The task is inherently challenging as our models must not only effectively utilize information from both modalities in current tasks but also preserve their cross-modal association in old tasks to mitigate catastrophic forgetting during audio-visual continual learning. To address these challenges, we propose a novel approach named ContAV-Sep (\textbf{Cont}inual \textbf{A}udio-\textbf{V}isual Sound \textbf{Sep}aration). ContAV-Sep presents a novel Cross-modal Similarity Distillation Constraint (CrossSDC) to uphold the cross-modal semantic similarity through incremental tasks and retain previously acquired knowledge of semantic similarity in old models, mitigating the risk of catastrophic forgetting. The CrossSDC can seamlessly integrate into the training process of different audio-visual sound separation frameworks. Experiments demonstrate that ContAV-Sep can effectively mitigate catastrophic forgetting and achieve significantly better performance compared to other continual learning baselines for audio-visual sound separation. Code is available at: \url{<a class="link-external link-https" href="https://github.com/weiguoPian/ContAV-Sep_NeurIPS2024" rel="external noopener nofollow">this https URL</a>}.

Continuous speech separation: Dataset and analysis

CONTINUOUS SPEECH SEPARATION WITH CONFORMER

Overlap Aware Continuous Speech Separation Without Permutation Invariant Training

Dual-Path Modeling for Long Recording Speech Separation in Meetings

A Deep Analysis of Speech Separation Guided Diarization Under Realistic Conditions

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation

A comprehensive study of speech separation: spectrogram vs waveform separation

Dual-Path Rnn For Long Recording Speech Separation

ConSep: a Noise- and Reverberation-Robust Speech Separation Framework by Magnitude Conditioning

Dual-Path Modeling with Memory Embedding Model for Continuous Speech Separation

Continual Audio-Visual Sound Separation

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Multi-channel Conversational Speaker Separation via Neural Diarization

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Cepstral Smoothing of Spectral Masks for Acoustic Vector-Sensor Based Convolutive Speech Separation

Listening and Grouping: an Online Autoregressive Approach for Monaural Speech Separation

State and Frontiers of Research in Speech Separation

Supervised Speech Separation Based on Deep Learning: An Overview

A Robust Unsupervised Method for the Single Channel Speech Separation.

The 2010 Signal Separation Evaluation Campaign (SiSEC2010): Audio Source Separation

Speaker and Direction Inferred Dual-channel Speech Separation