Weakly Supervised Target-Speaker Voice Activity Detection

Zixin Zhao,Lan Zhang
DOI: https://doi.org/10.1109/bigcom61073.2023.00041
2023-01-01
Abstract:Target Speaker Voice Activity Detection (TS-VAD) is a widely used technique for detecting the voice of a target speaker in the input audio stream. However, training TS-VAD model requires accurate frame-level labels indicating the temporal localization of the target speaker, which is labor-intensive for human-annotators especially when input audio contains overlapping segments. We aim to investigate how to train TS-VAD with clip-level labels which indicate the presence or absence of the target speaker’s voice in the audio stream, without accurate temporal duration information. This problem falls under the category of weakly supervised learning, however, we find that multiple instance learning, a popular weakly supervised learning framework, is not an effective solution for weakly supervised TS-VAD. In this work, we propose a novel weakly supervised training method for TS-VAD to explore the correlation between frame-level decisions and clip-level labels. Our method takes the frame-level decisions as weights of frame features of the input audio, and extracts the speaker embedding by using the weighted features. Our model is optimized to minimize the loss between speaker embedding similarity and clip-level label. Experiments show that our weakly supervised TS-VAD achieves 18.3% Event-F1, while the Event-F1 is only 5.8% by using the existing weakly supervised method.
What problem does this paper attempt to address?