Aligning Audiovisual Features for Audiovisual Speech Recognition.

Fei Tao,Carlos Busso
DOI: https://doi.org/10.1109/icme.2018.8486455
2018-01-01
Abstract:Visual information can improve the performance of automatic speech recognition (ASR), especially in the presence of background noise or different speech modes. A key problem is how to fuse the acoustic and visual features leveraging their complementary information and overcoming the alignment differences between modalities. Current audiovisual ASR (AV-ASR) systems rely on linear interpolation or extrapolation as a pre-processing technique to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing methods oversimplify the phase difference between lip motion and speech, lacking flexibility and impairing the performance of the system. This paper addresses the fusion of audiovisual features with an alignment neural network (AliNN), relying on recurrent neural network (RNN) with attention model. The proposed front-end model can automatically learn the alignment from the data. The resulting aligned features are concatenated and fed to conventional back-end ASR systems. The proposed front-end system is evaluated with matched and mismatch channel conditions, under clean and noisy recordings. The results show that our proposed approach can relatively outperform the baseline by 24.9% with Gaussian mixture model with hidden Markov model (GMM-HMM) back-end and 2.4% with deep neural network with hidden Markov model (DNN-HMM) back-end.
What problem does this paper attempt to address?