Abstract:In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.

Speakerfilter: deep learning-based target speaker extraction using anchor speech

Hierarchical speaker representation for target speaker extraction

Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech Separation

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction

Selective Listening by Synchronizing Speech with Lips

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications

Improving Target Speaker Extraction with Sparse LDA-transformed Speaker Embeddings

Binaural Selective Attention Model for Target Speaker Extraction

Cracking the cocktail party problem by multi-beam deep attractor network

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network

Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Single-Channel Multi-Speaker Separation using Deep Clustering

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation

Speech separation of a target speaker based on deep neural networks

Target conversation extraction: Source separation using turn-taking dynamics

A Speaker-Dependent Approach to Separation of Far-Field Multi-Talker Microphone Array Speech for Front-End Processing in the CHiME-5 Challenge

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition