Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer

Wang, Huijuan
DOI: https://doi.org/10.1007/s00371-024-03515-y
IF: 2.835
2024-06-12
The Visual Computer
Abstract:Lip-reading has attracted more and more attention in recent years, and has wide application prospects and value in areas such as human–computer interaction, surveillance and security and audiovisual speech recognition. However, research on lip-reading has been slow due to the complexity of dealing with the fine spatial features of small-sized images of continuous video frames and the temporal features between images. In this paper, to address the challenges in extracting visual spatial features, temporal features and model light weighting, we propose a high-precision, highly robust and lightweight lip-reading method, Mini-3DCvT, which combines visual transforms and 3D convolution to extract spatiotemporal feature of continuous images, and makes full use of the properties of convolution and transforms to effectively extract local and global features of continuous images, use weight transformation and weight distillation in the convolution and transformer structures for model compression, and then send the extracted features to a bidirectional gated recurrent unit for sequence modeling. Experimental results on the large-scale public lip-reading datasets LRW and LRW-1000 show that this paper's method achieves 88.3% and 57.1% recognition accuracy on both datasets, and effectively reduces the model computation and number of parameters, improving the overall performance of the lip-reading model.
computer science, software engineering
What problem does this paper attempt to address?