Abstract:Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised visual representation learning. We use the recently-proposed Raw Audio-Visual Speech Encoders (RAVEn) framework to pre-train an audio-visual model with unlabelled multilingual data, and then fine-tune the visual model on labelled transcriptions. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance; (2) multi-lingual outperforms English-only pre-training; (3) using languages which are more similar yields better results; and (4) fine-tuning on unseen languages is competitive to using the target language in the pre-training set. We hope our study inspires future research on non-English-only speech representation learning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to use self - supervised learning methods to learn visual speech representations in a cross - language environment. Specifically, the paper explores how to use multilingual data to pre - train a model and then fine - tune it on language - specific data to improve the performance of visual speech recognition. The paper points out that although existing research mainly focuses on creating representations using audio signals, this paper specifically focuses on cross - language self - supervised learning of visual signals. The main contributions of the paper are as follows: 1. **Explore the effects of multilingual pre - training**: The study found that when more data is used for pre - training, multilingual models are generally superior to monolingual models. However, when the amount of pre - training data is kept fixed, monolingual models may perform better. 2. **Multilingual pre - training is superior to English - only pre - training**: Even when the amount of data is the same, multilingual pre - training is still superior to pre - training using only English. 3. **The influence of language similarity**: Using more similar languages for pre - training and fine - tuning can achieve better results. 4. **Generalization ability for unseen languages**: When fine - tuning on unseen languages, the performance of multilingual pre - trained models is still competitive. Through these studies, the author hopes to inspire more future research in the field of non - English speech representation learning.

Learning Cross-lingual Visual Speech Representations

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Jointly Learning Visual and Auditory Speech Representations from Raw Data

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

Unsupervised ASR via Cross-Lingual Pseudo-Labeling

Speech Guided Disentangled Visual Representation Learning for Lip Reading

Audiovisual Masked Autoencoders

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Multilingual Vision-Language Pre-training for the Remote Sensing Domain