DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization

Abudukelimu Wuerkaixi,Kunda Yan,You Zhang,Zhiyao Duan,Changshui Zhang
DOI: https://doi.org/10.1109/mmsp55362.2022.9948860
2022-01-01
Abstract:Speaker diarization aims to determine “who spoke when” in multi-speaker scenarios. Audio-visual speaker diarization leverages visual information in addition to audio signals and has shown improved performance. Existing audio-visual methods extract speaker embeddings for each video clip using audio and facial features, and then perform clustering according to their similarity. However, this approach would not work well for noisy or overlapped speech where audio features are corrupted, nor for off-screen speakers where visual features are missing. In this work, we propose dynamic vision-guided speaker embedding (DyViSE), a novel method for leveraging visual information to extract speaker embeddings in a multi-stage system. DyViSE uses dynamic lip movement information to denoise audio in a latent space and integrates facial features to obtain an identity-discriminative embedding for each speaking segment. DyViSE is trained with a deep clustering loss along with an exemplary loss. DyViSE demonstrates remarkable performance on both real-world videos and artificially assembled videos. Our code is available at https://github.com/urkax/DyViSE.
What problem does this paper attempt to address?