Target Speaker Extraction by Directly Exploiting Contextual Information in the Time-Frequency Domain

Xue Yang,Changchun Bao,Jing Zhou,Xianhong Chen
2024-02-27
Abstract:In target speaker extraction, many studies rely on the speaker embedding which is obtained from an enrollment of the target speaker and employed as the guidance. However, solely using speaker embedding may not fully utilize the contextual information contained in the enrollment. In this paper, we directly exploit this contextual information in the time-frequency (T-F) domain. Specifically, the T-F representations of the enrollment and the mixed signal are interacted to compute the weighting matrices through an attention mechanism. These weighting matrices reflect the similarity among different frames of the T-F representations and are further employed to obtain the consistent T-F representations of the enrollment. These consistent representations are served as the guidance, allowing for better exploitation of the contextual information. Furthermore, the proposed method achieves the state-of-the-art performance on the benchmark dataset and shows its effectiveness in the complex scenarios.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper mainly explores the problem of extracting target speaker's voice from speech signals. Many existing methods rely on embedding information of the target speaker, but this approach may not fully utilize the contextual information in the registration information. This paper proposes a new method that directly utilizes this contextual information in the time-frequency domain (T-F domain). Specifically, by utilizing an attention mechanism to calculate the weight matrix of the mixed signal and the registration signal, these matrices reflect the similarity between frames in the time-frequency representation and are used to obtain consistent time-frequency representations as better guidance for extracting the target speaker's features. Experimental results show that this method achieves state-of-the-art performance on benchmark datasets and demonstrates effectiveness in complex scenarios.