Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation

Huang, Shibo
DOI: https://doi.org/10.1007/s11704-023-2230-x
IF: 2.6688
2023-04-01
Frontiers of Computer Science
Abstract:In this paper, we proposed a seq2seq model based on self-attention and self-distillation for sentence-level lip reading. The model includes the CNN front-end, pixel-wise learning, temporal learning, and decoder. we apply the CNN front-end to capture shallow spatial features inside the image sequence, and employ the Resformer module for the deep spatial correlation between pixels per frame, namely, pixel-wise learning. Then, the encoder is utilized to learn the temporal features, namely, temporal learning. Finally, the decoder decodes visual information to realize text prediction. Besides, the model applies self-distillation to further improve the model. Through experiments on GRID, LRW and LRW-1000, the proposed model achieves competitive experimental results on WER, CER and Acc metrics. However, our work presents certain limitations in the model complexity issue, which need to be tackled in the subsequent work.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?