Lip Reading Based on 3D Face Modeling and Spatial Transformation Learning

Zhongwei Wu,Waner Chen,Jiewen Xu,Yuehai Wang
DOI: https://doi.org/10.1109/iccc54389.2021.9674610
2021-01-01
Abstract:Lip reading is the task of obtaining text information from a video without audio, which is of great significance for speech recognition when audio is damaged. Lip reading mainly needs to extract the appearance features and movement information of the speakers' lips. Lip reading based on deep learning has achieved much progress at present, but the lip deflection caused by the camera angle still affects the accuracy of lip reading. This paper has researched reducing the effect of camera angle on lip reading. We use 3D face alignment to obtain spatial depth information, and color attributes to modulate depth information to add the missing information due to pose changes. And at the same time, we use deformable convolution to learn spatial position transformation. The methods are verified on the TCD-TIMIT dataset, which has two camera angles: straight and 30°. The accuracy of lip reading on the 30° camera angle dataset can be significantly improved, with an accuracy close to the accuracy on the straight angle dataset. At the same time, the accuracy of lip reading on the straight camera angle dataset is also promoted. It proves the effectiveness of the method in this paper to weaken the effect of camera angle on lip reading.
What problem does this paper attempt to address?