Multimodal Voice Activity Detection

LIU Peng,WANG Zuoying
DOI: https://doi.org/10.3321/j.issn:1000-0054.2005.07.009
2005-01-01
Abstract:In speech recognition systems, the frame energy-based voice activity detection (VAD) method may be affected by interferance from background noise and non-stationary characteristics of the frame energy in the voice segment. This paper presents a model to improve the performance and robustness of VAD by introducing visual information. Data driven linear transformations are used for visual feature extraction with a general statistical VAD model and a two-stage fusion strategy in a multimodal VAD system. Experiments show a 55.0% reduction in the frame error rate and a 98.5% reduction in sentence breaking error rate with the multimodal VAD as compared to the frame energy-based audio VAD. The results show that multimodal method eliminates most sentence breaking errors, and improves frame detection performance.
What problem does this paper attempt to address?