Incorporating Lip Features into Audio-Visual Multi-Speaker DOA Estimation by Gated Fusion

Ya Jiang,Hang Chen,Jun Du,Qing Wang,Chin-Hui Lee
DOI: https://doi.org/10.1109/icassp49357.2023.10095549
2023-01-01
Abstract:The audio-visual direction of arrival (DOA) estimation has demonstrated superior performance recently. In this paper, we present a novel audio-visual multi-speaker DOA estimation network, which for the first time incorporates multi-speaker lip features to adapt the complex overlapping and noisy scenarios. Firstly, we encode the multi-channel audio features, the reference angles and the lip Regions of Interest (RoIs) detected from the video respectively to acquire high-level representations. Then the multi-modal embeddings of audio, speaker angles and lips are fused by a tri-modal gated fusion module to balance their contributions to the output. The fused embedding is sent to the backend network to obtain the accurate DOA estimation with the combination of the predicted speaker angular vectors and the speaker activities. Experimental results show that our proposed approach can reduce the localization error by 73.48% compared to the previous work on the 2021 Multi-modal Information based Speech Processing (MISP) Challenge corpus. Meanwhile, the high accuracy and stability of localization results demonstrate the robustness of the proposed model in multi-speaker scenarios.
What problem does this paper attempt to address?