An Experimental Analysis of Different Approaches to Audio–Visual Speech Recognition and Lip-Reading

Denis Ivanko,Dmitry Ryumin,Alexey Karpov
DOI: https://doi.org/10.1007/978-981-15-5580-0_16
2020-09-02
Abstract:In this paper, we have analyzed different approaches to audio–visual speech recognition. We mainly focused on testing different modalities fusion techniques, rather than other parts of AVSR (e.g., feature extraction methods). Tree audio–visual modalities integration methods were under consideration, namely GMM-CHMM, DNN-HMM and end-to-end approaches, defined as the most promising and commonly found in scientific literature. The testing was performed on two different datasets: on GRID corpus for the English language and on HAVRUS corpus for the Russian. Obtained results once again confirms the superiority of neural network approaches compared to the others in conditions when we have enough data to effectively train NN models, which was demonstrated by our experiments on the GRID dataset. On a more compact in size HAVRUS database, the best recognition results were demonstrated by the traditional GMM-CHMM approach. This paper presents our vision on current state of audio–visual speech recognition field and possible directions for the further research.
What problem does this paper attempt to address?