Abstract:Target speech separation refers to extracting the target speaker's speech from mixed signals. Despite the recent advances in deep learning based close-talk speech separation, the applications to real-world are still an open issue. Two main challenges are the complex acoustic environment and the real-time processing requirement. To address these challenges, we propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture in reverberant environments, assisted with directional information of the speaker(s). Firstly, against variations brought by complex environment, the key idea is to increase the acoustic representation completeness through the jointly modeling of temporal, spectral and spatial discriminability between the target and interference source. Specifically, temporal, spectral, spatial along with the designed directional features are integrated to create a joint acoustic representation. Secondly, to reduce the latency, we design a fully-convolutional autoencoder framework, which is purely end-to-end and single-pass. All the feature computation is implemented by the network layers and operations to speed up the separation procedure. Evaluation is conducted on simulated reverberant dataset WSJ0-2mix and WSJ0-3mix under speaker-independent scenario. Experimental results demonstrate that the proposed method outperforms state-of-the-art deep learning based multi-channel approaches with fewer parameters and faster processing speed. Furthermore, the proposed temporal-spatial neural filter can handle mixtures with varying and unknown number of speakers and exhibits persistent performance even when existing a direction estimation error. Codes and models will be released soon.

Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues

SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model

Target Sound Extraction with Variable Cross-modality Clues

Improving Target Sound Extraction with Timestamp Information

A Study of Multichannel Spatiotemporal Features and Knowledge Distillation on Robust Target Speaker Extraction

Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information

Time Difference of Arrival Estimation Exploiting Multichannel Spatio-Temporal Prediction

CATSE: A Context-Aware Framework for Causal Target Sound Extraction

SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

DeFT-Mamba: Universal Multichannel Sound Separation and Polyphonic Audio Classification

DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing.

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Acoustic-Scene-Aware Target Sound Separation With Sound Embedding Refinement

Speaker Extraction with Detection of Presence and Absence of Target Speakers

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Beamformer-Guided Target Speaker Extraction

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation