Abstract:Lip‐reading deciphers speech without audio data, and deep learning advancements have improved lip‐reading in English and Chinese. Cantonese lip‐reading sentences, a Cantonese lip‐reading dataset, and a novel visual frontend, 3D‐visual attention net, which achieves comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets, are introduced. This addresses the scarcity of Cantonese research and provides a new foundation for dialect lip‐reading, fostering the advancement of Cantonese lip‐reading tasks. Lip‐reading deciphers speech by observing lip movements without relying on audio data. The rapid advancements in deep learning have significantly improved lip‐reading for both English and Chinese; however, research on dialects such as Cantonese remains scarce. Consequently, most Chinese lip‐reading datasets focus on Mandarin, with only a few addressing Cantonese. To bridge this gap, a sentence‐level Cantonese lip‐reading dataset, designated as Cantonese lip‐reading sentences are introduced, comprising over 500 unique speakers and more than 30,000 samples. To ensure alignment with real‐world scenarios, no restrictions are imposed on factors such as gender, age, posture, lighting conditions, or speech rate. A comprehensive description of the pipeline employed is provided for collecting and constructing the dataset and introduce an innovative visual frontend, 3D‐visual attention net. This frontend combines the advantages of convolution and self‐attention mechanisms to extract fine‐grained lip region features. These features are subsequently input into the conformer backend for temporal sequence modelling, achieving comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip‐reading sentences datasets. Benchmark tests on Cantonese lip‐reading sentences demonstrate the challenges it poses, providing a novel research foundation for dialect lip‐reading and fostering the advancement of Cantonese lip‐reading tasks.

Summary on the Chat-Scenario Chinese Lipreading (chatclr) Challenge

Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge

Lip Reading Sentences in the Wild

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Application of deep learning in Mandarin Chinese lip-reading recognition

HearMe: Accurate and Real-time Lip Reading based on Commercial RFID Devices

Learn an Effective Lip Reading Model without Pains

A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading.

Multi-Grained Spatio-temporal Modeling for Lip-reading

Enhancing Lip Reading with Multi-Scale Video and Multi-Encoder

Advances and Challenges in Deep Lip Reading

Part-Based Lipreading for Audio-Visual Speech Recognition.

Cantonese sentence dataset for lip‐reading

Automatic Lip Reading System Based on a Fusion Lightweight Neural Network with Raspberry Pi

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Acoustic-Based Lip Reading for Mobile Devices: Dataset, Benchmark and a Self Distillation-Based Approach

A Two-stage Single-channel Speaker-dependent Speech Separation Approach for Chime-5 Challenge.

Enabling Private and Non-Intrusive Smartphone Calls with LipTalk

The First Evaluation of Chinese Human-Computer Dialogue Technology