Dual-Channel Target Speaker Extraction Based on Conditional Variational Autoencoder and Directional Information
Rui Wang,Li Li,Tomoki Toda
DOI: https://doi.org/10.1109/taslp.2024.3376154
2024-01-01
Abstract:Target speaker extraction (TSE) has become an attractive research topic in recent years. However, TSE under the underdetermined conditions is still a challenge. In this paper, we deal with a dual-channel TSE problem under underdetermined conditions. Geometric source separation (GSS) is used to be a solution to the TSE problem, but the performance of conventional GSS methods is limited under underdetermined conditions because of the lack of a powerful source model. We propose a dual-channel TSE method with the combined capabilities of target selection based on geometric constraints, more powerful source modeling, and nonlinear postprocessing. A geometric constraint (GC) on the target direction of arrival (DOA) is applied to select the target, and two conditional variational autoencoders (CVAEs) are used to model a single speaker's speech and interference mixture speech. For postprocessing, an ideal ratio timefrequency (TF) mask estimated from the separated interference mixture speech is used to extract the target speaker's speech. Moreover, to overcome the impact of DOA estimation errors, we improve the objective function so that the target DOA information can be modified. The experimental results demonstrate that the proposed method achieves 6.24 dB and 8.37 dB improvements compared with the baseline method in terms of signal-to-distortion ratio (SDR) and source-to-interference ratio (SIR), respectively, under medium reverberation for 470 ms. Furthermore, through the analysis of experimental results, we found that the improvement method is robust against DOA estimation errors.
engineering, electrical & electronic,acoustics