Abstract:Voice activity detection (VAD) is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to-noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an ensemble learning framework, named multi-resolution stacking (MRS), which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions of its lower building blocks and the expansion of the raw acoustic feature by a given window (called a resolution). At the middle level, we describe a base classifier in MRS, named boosted deep neural network (bDNN). bDNN first generates multiple base predictions from different contexts of a single frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally-expensive boosting methods that train ensembles of classifiers for multiple base predictions. At the bottom level, we employ the multi-resolution cochleagram feature, which incorporates the contextual information by concatenating the cochleagram features at multiple spectrotemporal resolutions. Experimental results show that the MRS-based VAD outperforms other VADs by a considerable margin. Moreover, when trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.

A Lightweight Framework for Online Voice Activity Detection in the Wild.

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Voice activity detection in the wild: A data-driven approach using teacher-student training

Voice Activity Detection in the Wild Via Weakly Supervised Sound Event Detection

A Real-Time Voice Activity Detection Based On Lightweight Neural

A Light Weight Model for Active Speaker Detection

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Real-time Architecture for Audio-Visual Active Speaker Detection.

Personal VAD: Speaker-Conditioned Voice Activity Detection

Multi-task Joint-Learning for Robust Voice Activity Detection

A Voice Spoofing Detection Framework for IoT Systems with Feature Pyramid and Online Knowledge Distillation.

A Hierarchical Framework Approach for Voice Activity Detection and Speech Enhancement.

End-to-End Speaker-Dependent Voice Activity Detection

Dynamic Ensemble Teacher-Student Distillation Framework for Light-weight Fake Audio Detection

SVVAD: Personal Voice Activity Detection for Speaker Verification

A New Vad Framework Using Statistical Model And Human Knowledge Based Empirical Rule

A Robust, Real-Time Voice Activity Detection Algorithm for Embedded Mobile Devices.

Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments

Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection

Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

Voice Activity Detection (VAD) in Noisy Environments