Abstract:As an alternative approach, viseme-based lipreading systems have demonstrated promising performance results in decoding videos of people uttering entire sentences. However, the overall performance of such systems has been significantly affected by the efficiency of the conversion of visemes to words during the lipreading process. As shown in the literature, the issue has become a bottleneck of such systems where the system’s performance can decrease dramatically from a high classification accuracy of visemes (e.g., over 90%) to a comparatively very low classification accuracy of words (e.g., only just over 60%). The underlying cause of this phenomenon is that roughly half of the words in the English language are homophemes, i.e., a set of visemes can map to multiple words, e.g., “time” and “some”. In this paper, aiming to tackle this issue, a deep learning network model with an Attention based Gated Recurrent Unit is proposed for efficient viseme-to-word conversion and compared against three other approaches. The proposed approach features strong robustness, high efficiency, and short execution time. The approach has been verified with analysis and practical experiments of predicting sentences from benchmark LRS2 and LRS3 datasets. The main contributions of the paper are as follows: (1) A model is developed, which is effective in converting visemes to words, discriminating between homopheme words, and is robust to incorrectly classified visemes; (2) the model proposed uses a few parameters and, therefore, little overhead and time are required to train and execute; and (3) an improved performance in predicting spoken sentences from the LRS2 dataset with an attained word accuracy rate of 79.6%—an improvement of 15.0% compared with the state-of-the-art approaches.

Collaborative Viseme Subword and End-to-end Modeling for Word-level Lip Reading

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading

Multi-Grained Spatio-temporal Modeling for Lip-reading

Sub-word Level Lip Reading With Visual Attention

Importance-Aware Information Bottleneck Learning Paradigm for Lip Reading

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

Leveraging Visemes for Better Visual Speech Representation and Lip Reading

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Learn an Effective Lip Reading Model without Pains

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

Decoding visemes: improving machine lipreading

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

Masked Vision and Language Modeling for Multi-modal Representation Learning

Efficient DNN Model for Word Lip-Reading

Disentangling Homophemes in Lip Reading using Perplexity Analysis

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading