Multi-Modal Knowledge Transfer for Target Speaker Lipreading with Improved Audio-Visual Pretraining and Cross-Lingual Fine-Tuning

Genshun Wan,Zhongfu Ye
DOI: https://doi.org/10.1109/icmew63481.2024.10645443
2024-01-01
Abstract:Lipreading aims to predict speech content based on lip movement without replying on audio. This paper focuses on the Task 2 of the Grand Challenge on the chat-scenario Chinese lipreading in ICME 2024, which focuses on the target speaker lipreading. To this end, we propose an improved cross-lingual multi-modal knowledge transfer method. Specifically, the audio-visual self-supervised learning method is improved based on enhanced representation knowledge transfer to over-come the mismatch of application scenarios and elevate the quality of clustering. A cross-lingual knowledge transfer-based multi-modal fine-tuning mechanism is introduced to exploit cross-lingual pretrained models. To expand the coverage of spoken content and speaking scenes, the target-speaker lipreading model is constructed with multiple target speakers' data combined together. Our system achieved a character error rate of 77.61%, representing a relative improvement of 22.15 % over the official baseline system, which ranks first in the ChatCLR Challenge.
What problem does this paper attempt to address?