Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Hang Chen,Jun Du,Yu Hu,Li-Rong Dai,Bao-Cai Yin,Chin-Hui Lee
DOI: https://doi.org/10.21437/interspeech.2021-723
2021-01-01
Abstract:In this paper, we propose a novel deep learning architecture for improving word-level lip-reading. We first incorporate multiscale processing into spatial feature extraction for lip-reading using hierarchical pyramidal convolution (HPConv) and selfattention. Specifically, HPConv is proposed to replace the conventional convolution features, leading to an improvement over the model’s ability to discover fine-grained lip movements. Next to deal with fixed-length image sequences representing words in a given database, a self-attention mechanism is proposed to integrate local information in all lip frames without assuming known word boundaries, so that our deep models automatically utilize key feature in relevant frames of a given word. Experiments on the Lip Reading in the Wild corpus show that our proposed architecture achieves an accuracy of 86.83%, yielding a relative error rate reduction of about 10% from that obtained with a state-of-the-art scheme of averaging frame scores for information fusion. A detailed analysis of the experimental results also confirms that weights learned from self-attention tend to be zero at both sides of an image sequence and focus non-zero weights in the middle part of a given word.
What problem does this paper attempt to address?