Understanding Pictograph with Facial Features: End-to-end Sentence-Level Lip Reading of Chinese

Xiaobing Zhang,Haigang Gong,Xili Dai,Fan Yang,Nianbo Liu,Ming Liu
DOI: https://doi.org/10.1609/aaai.v33i01.33019211
2019-01-01
Proceedings of the AAAI Conference on Artificial Intelligence
Abstract:With the breakthrough of deep learning, lip reading technologies are under extraordinarily rapid progress. It is well-known that Chinese is the most widely spoken language in the world. Unlike alphabetic languages, it involves more than 1,000 pronunciations as Pinyin, and nearly 90,000 pictographic characters as Hanzi, which makes lip reading of Chinese very challenging. In this paper, we implement visual-only Chinese lip reading of unconstrained sentences in a two-step end-to-end architecture (LipCH-Net), in which two deep neural network models are employed to perform the recognition of Picture-to-Pinyin (mouth motion pictures to pronunciations) and the recognition of Pinyin-to-Hanzi (pronunciations to texts) respectively, before having a jointly optimization to improve the overall performance. In addition, two modules in the Pinyin-to-Hanzi model are pre-trained separately with large auxiliary data in advance of sequence-to-sequence training to make the best of long sequence matches for avoiding ambiguity. We collect 6-month daily news broadcasts from China Central Television (CCTV) website, and semi-automatically label them into a 20.95 GB dataset with 20,495 natural Chinese sentences. When trained on the CCTV dataset, the LipCH-Net model outperforms the performance of all state-of-the-art lip reading frameworks. According to the results, our scheme not only accelerates training and reduces over-fitting, but also overcomes syntactic ambiguity of Chinese which provides a baseline for future relevant work.
What problem does this paper attempt to address?