Voice activity detection based on sequential Gaussian mixture model with maximum likelihood criterion

Zhan Shen,Jianguo Wei,Wenhuan Lu,Jianwu Dang
DOI: https://doi.org/10.1109/ISCSLP.2016.7918417
2016-01-01
Abstract:Voice activity detection is a binary classifier that partitions the frame sequence or frequency bins into speech/nonspeech clusters in an on-line manner. The Gaussian model was conventionally employed to describe the probability density function of speech/nonspeech signals, and classification was conducted based on likelihood. However, the conventional technique was not unified into a theoretical framework that enables optimal classification. This paper makes use of a sequential Gaussian mixture model (GMM) to model the logarithmic power sequence at each frequency band. The sequential likelihood function is presented to estimate the parameter set of this G-MM frame by frame. The likelihood function is sequentially maximized based on the iterative Newton-Raphson algorithm, and the on-line estimation is expressed as a first-order regression. Eventually, the power sequence is classified into speech/nonspeech based on the criterion of the maximum likelihood. The experimental result confirmed the superiority of the proposed method.
What problem does this paper attempt to address?