Abstract:Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.

Weakly Supervised Target-Speaker Voice Activity Detection

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Voice activity detection in the wild: A data-driven approach using teacher-student training

Voice Activity Detection in the Wild Via Weakly Supervised Sound Event Detection

SVVAD: Personal Voice Activity Detection for Speaker Verification

Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly Detection

Personal VAD: Speaker-Conditioned Voice Activity Detection

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly Detection

End-to-End Speaker-Dependent Voice Activity Detection

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

Profile-Error-Tolerant Target-Speaker Voice Activity Detection

Joint framework with deep feature distillation and adaptive focal loss for weakly supervised audio tagging and acoustic event detection

Voice activity detection using a local-global attention model

Towards Weakly Supervised Text-to-Audio Grounding

Target Active Speaker Detection with Audio-visual Cues

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection

Multi-task Joint-Learning for Robust Voice Activity Detection

Weakly Supervised Temporal Adjacent Network for Language Grounding