3D Spatial Features for Multi-Channel Target Speech Separation

Rongzhi Gu,Shi-Xiong Zhang,Meng Yu,Dong Yu
DOI: https://doi.org/10.1109/asru51503.2021.9688198
2021-01-01
Abstract:The use of speaker's directional information for speech sepa-ration and speech recognition has demonstrated the state-of-the-art performances on multi-talker scenarios. One major limitation of previous approaches using speaker's directional information is the significant performance degradation when the coming directions of two sound sources are close. To address these challenges, this paper proposed a set of new three-dimensional (3D) spatial features for target speech sep-aration, by leveraging all the 3D location information of the target speaker, including azimuth, elevation, and the distance to the microphone array center. Previous works in this area are extended in two important directions. First, the traditional 1D directional features are generalized to 3D spatial features. Thus more discriminative spatial diversity between speakers is achieved. Second, to unleash the full power of these 3D spatial features, a microphone pair-wise attention model is also proposed. The proposed features and models were evaluated on both simulated reverberant datasets and real recordings under near and far-field conditions. Exper-imental results show that both proposed 3D spatial features and attention models can significantly improve the separation performance as well as reducing the recognition error rate.
What problem does this paper attempt to address?