Abstract:Voice Activity Detection (VAD) is a crucial component of Speech Enhancement (SE) for accurately estimating noise, which directly affects the SE effectiveness in improving speech quality. However, conventional non-data-driven VADs often suffer from decreased accuracy at a low signal-to-noise ratio (SNR). To address this issue, a multi-feature and cosine similarity-based multi-observation VAD algorithm (mVAD) are proposed in this study. This algorithm selects noise-robust features, with Mel-frequency Cepstral Coefficients (MFCCs) as the main features, and utilizes several optimization techniques and an adaptive threshold for background noise updating. Furthermore, the soft VAD results are smoothed with an improved exponential moving average (EMA) algorithm. Besides, a shifting window is utilized to track the mean value and obtain an adaptive threshold for converting the soft results to binary ones. Experimental results indicate that mVAD can maintain high classification accuracy down to -10 dB with an increment of approximately 28% while also being computationally efficient for the CPU time (about 1/3 of statistical model-based methods). It also maintained high robustness at SNRs less than 0 dB (Δ ≤ 2.1 %). Moreover, sometimes mVAD even achieved higher accuracy levels than deep learning-based VADs. To further demonstrate the effectiveness of the proposed method, the VAD results are used as an additional feature to train and test a neural network (NN)-based SE model, enhancing the SE performance. This study proves that mVAD does not rely on prior noise knowledge, reaching the dual effect of complexity reduction and accuracy improvement for speech enhancement, making it a promising approach for robust VAD in low SNR environments.

Robust Voice Activity Detection Using an Auditory-Inspired Masked Modulation Encoder Based Convolutional Attention Network

Robust Voice Activity Detection Using a Masked Auditory Encoder Based Convolutional Neural Network.

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

A Robust and Lightweight Voice Activity Detection Algorithm for Speech Enhancement at Low Signal-to-noise Ratio

Phase Aware Deep Neural Network For Noise Robust Voice Activity Detection

A Universal VAD Based on Jointly Trained Deep Neural Networks.

sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks

An improved noise-robust voice activity detector based on hidden semi-Markov models

Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Voice activity detection based on speech enhancement method

AADNet: An End-to-End Deep Learning Model for Auditory Attention Decoding

Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection.

Deep Learning Approaches for Voice Activity Detection

Phase Continuity-Aware Self-Attentive Recurrent Network with Adaptive Feature Selection for Robust VAD

Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection

Noise Robust Voice Activity Detection Using Joint Phase and Magnitude Based Feature Enhancement.

Speech enhancement aided end-to-end multi-task learning for voice activity detection

Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders

Multi-resolution Auditory Cepstral Coefficient and Adaptive Mask for Speech Enhancement with Deep Neural Network

A Real-Time Voice Activity Detection Based On Lightweight Neural