Abstract:Lip‐reading deciphers speech without audio data, and deep learning advancements have improved lip‐reading in English and Chinese. Cantonese lip‐reading sentences, a Cantonese lip‐reading dataset, and a novel visual frontend, 3D‐visual attention net, which achieves comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets, are introduced. This addresses the scarcity of Cantonese research and provides a new foundation for dialect lip‐reading, fostering the advancement of Cantonese lip‐reading tasks. Lip‐reading deciphers speech by observing lip movements without relying on audio data. The rapid advancements in deep learning have significantly improved lip‐reading for both English and Chinese; however, research on dialects such as Cantonese remains scarce. Consequently, most Chinese lip‐reading datasets focus on Mandarin, with only a few addressing Cantonese. To bridge this gap, a sentence‐level Cantonese lip‐reading dataset, designated as Cantonese lip‐reading sentences are introduced, comprising over 500 unique speakers and more than 30,000 samples. To ensure alignment with real‐world scenarios, no restrictions are imposed on factors such as gender, age, posture, lighting conditions, or speech rate. A comprehensive description of the pipeline employed is provided for collecting and constructing the dataset and introduce an innovative visual frontend, 3D‐visual attention net. This frontend combines the advantages of convolution and self‐attention mechanisms to extract fine‐grained lip region features. These features are subsequently input into the conformer backend for temporal sequence modelling, achieving comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets. Benchmark tests on Cantonese lip‐reading sentences demonstrate the challenges it poses, providing a novel research foundation for dialect lip‐reading and fostering the advancement of Cantonese lip‐reading tasks.

Variable Structure and Modeling Units for Chinese Lipreading

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Multi-Grained Spatio-temporal Modeling for Lip-reading

Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading

Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading.

Cantonese sentence dataset for lip‐reading

Pathogenesis of avian flu H5N1 and SARS.

TCS-LipNet: Temporal & Channel & Spatial Attention-Based Lip Reading Network

Application of deep learning in Mandarin Chinese lip-reading recognition

Learn an Effective Lip Reading Model without Pains

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Mutual Information Maximization for Effective Lip Reading

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

Synchronous Bidirectional Learning for Multilingual Lip Reading

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Finding phonemes: improving machine lip-reading

Visual Features Extracting & Selecting For Lipreading