ViSpeR: Multilingual Audio-Visual Speech Recognition

Sanath Narayan,Yasser Abdelaziz Dahou Djilali,Ankit Singh,Eustache Le Bihan,Hakim Hacid
2024-05-27
Abstract:This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at \href{<a class="link-external link-https" href="https://github.com/YasserdahouML/visper" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/YasserdahouML/visper" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?