Fusing audio and visual features of speech

Hao Pan,Zhi-pei Liang,Thomas S. Huang
DOI: https://doi.org/10.1109/ICIP.2000.899333
2000-01-01
ICIP
Abstract:In this paper, the audio and visual features of speech are integrated using a novel fused-HMM. We assume that the two sets of features may have different data rates and duration. Hidden Markov models (HMMs) are first used to model them separately, and then a general Bayesian fusion method, which is optimal in the maximum entropy sense, is employed to fuse them together. Particularly, an efficient learning algorithm is introduced. Instead of maximizing the joint likelihood of the fuse-HMM, the learning algorithm maximizes the two HMMs separately, and then fuses the HMMs together. In addition, an inference algorithm is proposed. We have tested the proposed method by person verification experiments. Results show that the proposed method significantly reduces the recognition error rates as compared to the unimodal HMMs and the loosely-coupled fusion model
What problem does this paper attempt to address?