Learning Cross-lingual Visual Speech Representations

Andreas Zinonos,Alexandros Haliassos,Pingchuan Ma,Stavros Petridis,Maja Pantic
2023-03-15
Abstract:Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised visual representation learning. We use the recently-proposed Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual model with unlabelled multilingual data, and then fine-tune the visual model on labelled transcriptions. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance; (2) multi-lingual outperforms English-only pre-training; (3) using languages which are more similar yields better results; and (4) fine-tuning on unseen languages is competitive to using the target language in the pre-training set. We hope our study inspires future research on non-English-only speech representation learning.
Computation and Language,Computer Vision and Pattern Recognition,Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use self - supervised learning methods to learn visual speech representations in a cross - language environment. Specifically, the paper explores how to use multilingual data to pre - train a model and then fine - tune it on language - specific data to improve the performance of visual speech recognition. The paper points out that although existing research mainly focuses on creating representations using audio signals, this paper specifically focuses on cross - language self - supervised learning of visual signals. The main contributions of the paper are as follows: 1. **Explore the effects of multilingual pre - training**: The study found that when more data is used for pre - training, multilingual models are generally superior to monolingual models. However, when the amount of pre - training data is kept fixed, monolingual models may perform better. 2. **Multilingual pre - training is superior to English - only pre - training**: Even when the amount of data is the same, multilingual pre - training is still superior to pre - training using only English. 3. **The influence of language similarity**: Using more similar languages for pre - training and fine - tuning can achieve better results. 4. **Generalization ability for unseen languages**: When fine - tuning on unseen languages, the performance of multilingual pre - trained models is still competitive. Through these studies, the author hopes to inspire more future research in the field of non - English speech representation learning.