HMM-based Lip Reading with Stingy Residual 3D Convolution

Qifeng Zeng,Jun Du,Zirui Wang
2021-01-01
Abstract:In this paper, we propose a novel approach for sentence-level lip-reading by using hidden Markov model (HMM) framework. To calculate the posterior probability of HMM states, the architecture of convolutional neural network based visual module followed by multi-headed self-attention Transformers is designed. Recently, 3D convolution for visual module to extract temporal features is popular for lip-reading tasks, which can achieve a higher accuracy at the cost of more computations compared with 2D convolution. This motivates us to invent plug-and-play compact 3D convolution unit called "Stingy Residual 3D" (StiRes3D). We use heterogeneous convolution kernels for different input channels, and apply channel-wise convolutions and point-wise convolutions to make the block compact. Evaluated on Lip Reading Sentence2 (LRS2-BBC) dataset, we first demonstrate that our HMM-based approach outperforms connectionist temporal classification (CTC) based approach with the same visual module and Transformer architecture, yielding a word error rate reduction of 1.9%. Then we empirically show that the proposed approach with StiRes3D based visual module can achieve obvious improvements in terms of both recognition accuracy and model efficiency, over the Pseudo 3D network with a compact 3D convolution design. Our approach also outperforms the current state-of-the-art approach with a word error rate reduction of 1.5%.
What problem does this paper attempt to address?