Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

Chenyu Yang,Mengxi Chen,Yanfeng Wang,Yu Wang
DOI: https://doi.org/10.1145/3581783.3612424
2023-01-01
Abstract:Audio-visual speaker diarization refers to the task of identifying "who spoke when" by using both audio and video data. Although previous fusion-based approaches have shown exceptional performance over audio-only methods, they have mainly focused on high-quality data and have not accounted for the impacts of acoustic noise or missing faces. To address these limitations, we propose a novel uncertainty-aware end-to-end audio-visual speaker diarization (UAV-SD) approach in this paper. Our approach leverages both framewise inter- and intra-modal confidence to achieve more effective and robust speaker diarization. By taking into account the uncertainty of the data, UAV-SD can achieve better diarization performance even in noisy or low-quality recordings. Additionally, our approach is compatible with multi-channel audio signals without the need to retrain the model, making it a more versatile solution. To evaluate the effectiveness of our approach, we conduct extensive experiments on the Multi-modal Information Based Speech Processing (MISP) 2022 Challenge datasets which consist of far-field audio and video data. The results show that UAV-SD is able to yield significant performance gains compared to baseline methods for both single and multi-channel data, demonstrating its effectiveness in real-world scenarios.
What problem does this paper attempt to address?