LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers
Feng Xue,Yu Li,Deyin Liu,Yincen Xie,Lin Wu,Richang Hong
DOI: https://doi.org/10.1109/tcsvt.2023.3282224
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Lipreading refers to understanding and further translating the speech of a video speaker into textual outputs. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference. However, generalizing those methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank as well as the dominant visual variations caused by the shape/color of lips presented by different speakers. Therefore, merely depending on the visible changes of lips tends to overfit the model. To improve to generalise, in this paper we propose to use multi-modal features, i.e., visual and landmark, to describe the lip motion while being irrespective to speaker characteristics. The proposed sentence-level framework, dubbed LipFormer, is based on visual-landmark transformer architecture wherein a lip motion stream, a facial landmark stream, and a cross-modal fusion are interconnected. More specifically, the two-stream embeddings produced by self-attention are prompted into a cross-attention module to achieve the alignment across visual and landmark variations. The resulting fused features are decoded into linguistic texts by a cascaded sequence-to-sequence translation. Extensive experiments demonstrate that our method can generalise well to unseen speakers in multiple datasets.
engineering, electrical & electronic