Cross-Modal Language Modeling in Multi-Motion-Informed Context for Lip Reading

Xi Ai,Bin Fang
DOI: https://doi.org/10.1109/TASLP.2023.3282109
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:We observe that for lip reading, the language is locally transformed, instead of globally transformed, i.e., speaking and writing follow the same basic grammar rules. In this work, we present a cross-modal language model to tackle the lip-reading challenge on silent videos. Compared to previous works, we consider multi-motion-informed contexts composed of multiple lip-motion representations from different subspaces to guide decoding via the source-target attention mechanism. We present a piece-wise pre-training strategy inspired by multi-task learning to pre-train a visual module to generate multi-motion-informed contexts for cross-modality and pre-train a decoder to generate texts for language modeling. Our final large-scale model outperforms baseline models on four datasets: LRS2, LRS3, LRW, and GRID. We will open our source code on GitHub.
Computer Science
What problem does this paper attempt to address?