Abstract:Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.

SAPVAD: An Efficient Voice Activity Detection Model Based on Spectral Attention and Parallel Structure.

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Applying Support Vector Machines to Voice Activity Detection

SVVAD: Personal Voice Activity Detection for Speaker Verification

Voice activity detection using a local-global attention model

Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability

sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

Multimodal Voice Activity Detection

Personal VAD: Speaker-Conditioned Voice Activity Detection

An efficient voice activity detection algorithm by combining statistical model and energy detection

A Universal VAD Based on Jointly Trained Deep Neural Networks.

Improved Voice Activity Detection Based on Long-term Spectral Divergence and Pitch Ratio Features

Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

Speech enhancement aided end-to-end multi-task learning for voice activity detection

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Voice Activity Detection (VAD) in Noisy Environments

Voice activity detection in the wild: A data-driven approach using teacher-student training

Robust Voice Activity Detection based on Pitch and Sub-band Energy

Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Voice Activity Detection Using Wavelets Multiresolution Spectrum and Short-time Adaptive Audio Mixing Algorithm