SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual Attention.

Ying Hu,Haitao Xu,Zhongcun Guo,Hao Huang,Liang He
DOI: https://doi.org/10.1109/ICASSP48485.2024.10447832
2024-01-01
Abstract:We propose a deep neural network with spectrogram matching and mutual attention (SMMA-Net) for audio clue-based target speaker extraction (TSE). To effectively use the auxiliary speech, we proposed spectrogram matching (SM) strategy and mutual attention (MA) block. We conducted all experiments on the WSJ0-2mix-extr dataset. The ablation and comparison studies verified the effectiveness of SM strategy and MA block. The experimental results show that our proposed method outperforms the state-of-the-art methods by a sizable margin of 1.3 dB on the metric of scale-invariant signal-to-distortion ratio improvement. Additionally, SMMA-Net achieved that the performance of model for TSE task exceeds that for speaker separation task under the similar architecture. The main code will be available at https://github.com/Ht-Xu/SMMA-Net.
What problem does this paper attempt to address?