Abstract:Lip‐reading deciphers speech without audio data, and deep learning advancements have improved lip‐reading in English and Chinese. Cantonese lip‐reading sentences, a Cantonese lip‐reading dataset, and a novel visual frontend, 3D‐visual attention net, which achieves comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets, are introduced. This addresses the scarcity of Cantonese research and provides a new foundation for dialect lip‐reading, fostering the advancement of Cantonese lip‐reading tasks. Lip‐reading deciphers speech by observing lip movements without relying on audio data. The rapid advancements in deep learning have significantly improved lip‐reading for both English and Chinese; however, research on dialects such as Cantonese remains scarce. Consequently, most Chinese lip‐reading datasets focus on Mandarin, with only a few addressing Cantonese. To bridge this gap, a sentence‐level Cantonese lip‐reading dataset, designated as Cantonese lip‐reading sentences are introduced, comprising over 500 unique speakers and more than 30,000 samples. To ensure alignment with real‐world scenarios, no restrictions are imposed on factors such as gender, age, posture, lighting conditions, or speech rate. A comprehensive description of the pipeline employed is provided for collecting and constructing the dataset and introduce an innovative visual frontend, 3D‐visual attention net. This frontend combines the advantages of convolution and self‐attention mechanisms to extract fine‐grained lip region features. These features are subsequently input into the conformer backend for temporal sequence modelling, achieving comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets. Benchmark tests on Cantonese lip‐reading sentences demonstrate the challenges it poses, providing a novel research foundation for dialect lip‐reading and fostering the advancement of Cantonese lip‐reading tasks.

Understanding Pictograph with Facial Features: End-to-end Sentence-Level Lip Reading of Chinese

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Application of deep learning in Mandarin Chinese lip-reading recognition

Cantonese sentence dataset for lip‐reading

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Learn an Effective Lip Reading Model without Pains

LCSNet: End-to-End Lipreading with Channel-aware Feature Selection

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Generalizing sentence-level lipreading to unseen speakers: a two-stream end-to-end approach

Adaptive Semantic-Spatio-Temporal Graph Convolutional Network for Lip Reading

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

Lip Reading Sentences in the Wild

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading.

Multi-Grained Spatio-temporal Modeling for Lip-reading

Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert

Lip Reading using Deep Learning

Improved methods for culturing rat glomerular cells.

Enhancing Lip Reading: A Deep Learning Approach with CNN and RNN Integration

HMM-based Lip Reading with Stingy Residual 3D Convolution