Abstract:Voice Activity Detection (VAD) is a crucial component of Speech Enhancement (SE) for accurately estimating noise, which directly affects the SE effectiveness in improving speech quality. However, conventional non-data-driven VADs often suffer from decreased accuracy at a low signal-to-noise ratio (SNR). To address this issue, a multi-feature and cosine similarity-based multi-observation VAD algorithm (mVAD) are proposed in this study. This algorithm selects noise-robust features, with Mel-frequency Cepstral Coefficients (MFCCs) as the main features, and utilizes several optimization techniques and an adaptive threshold for background noise updating. Furthermore, the soft VAD results are smoothed with an improved exponential moving average (EMA) algorithm. Besides, a shifting window is utilized to track the mean value and obtain an adaptive threshold for converting the soft results to binary ones. Experimental results indicate that mVAD can maintain high classification accuracy down to -10 dB with an increment of approximately 28% while also being computationally efficient for the CPU time (about 1/3 of statistical model-based methods). It also maintained high robustness at SNRs less than 0 dB (Δ ≤ 2.1 %). Moreover, sometimes mVAD even achieved higher accuracy levels than deep learning-based VADs. To further demonstrate the effectiveness of the proposed method, the VAD results are used as an additional feature to train and test a neural network (NN)-based SE model, enhancing the SE performance. This study proves that mVAD does not rely on prior noise knowledge, reaching the dual effect of complexity reduction and accuracy improvement for speech enhancement, making it a promising approach for robust VAD in low SNR environments.

Towards Improving Statistical Model Based Voice Activity Detection.

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

Applying Support Vector Machines to Voice Activity Detection

An improved noise-robust voice activity detector based on hidden semi-Markov models

Improved voice activity detection based on statistical likelihood ratio test

An efficient voice activity detection algorithm by combining statistical model and energy detection

Computational Auditory Scene Analysis Based Voice Activity Detection

Improved Voice Activity Detection Based on Long-term Spectral Divergence and Pitch Ratio Features

Sparse Power Spectrum Based Robust Voice Activity Detector

Voice Activity Detection Based on Complex Exponential Atomic Decomposition and Likelihood Ratio Test

Voice Activity Detection Based on Conjugate Subspace Matching Pursuit and Likelihood Ratio Test

A Robust and Lightweight Voice Activity Detection Algorithm for Speech Enhancement at Low Signal-to-noise Ratio

Multimodal Voice Activity Detection

Improving Voice Activity Detection Via Weighting Likelihood and Dimension Reduction

Combining Sub-bands SNR on Cochlear Model for Voice Activity Detection

Efficient voice activity detection algorithm based on sub-band temporal envelope and sub-band long-term signal variability

A Voice Activity Detection Method Based on DWT-MVNPDF

A robust voice activity detector based on Weibull and Gaussian Mixture distribution

A Feature Parameter Modification Algorithm for Voice Activity Detection Based on Support Vector Machine

Statistical Voice Activity Detection Based on Sparse Representation over Learned Dictionary

Sparse Representation with Optimized Learned Dictionary for Robust Voice Activity Detection