Seamless Communication,Loïc Barrault,Yu-An Chung,Mariano Cora Meglioli,David Dale,Ning Dong,Paul-Ambroise Duquenne,Hady Elsahar,Hongyu Gong,Kevin Heffernan,John Hoffman,Christopher Klaiber,Pengwei Li,Daniel Licht,Jean Maillard,Alice Rakotoarison,Kaushik Ram Sadagopan,Guillaume Wenzek,Ethan Ye,Bapi Akula,Peng-Jen Chen,Naji El Hachem,Brian Ellis,Gabriel Mejia Gonzalez,Justin Haaheim,Prangthip Hansanti,Russ Howes,Bernie Huang,Min-Jae Hwang,Hirofumi Inaguma,Somya Jain,Elahe Kalbassi,Amanda Kallet,Ilia Kulikov,Janice Lam,Daniel Li,Xutai Ma,Ruslan Mavlyutov,Benjamin Peloquin,Mohamed Ramadan,Abinesh Ramakrishnan,Anna Sun,Kevin Tran,Tuan Tran,Igor Tufanov,Vish Vogeti,Carleigh Wood,Yilin Yang,Bokai Yu,Pierre Andrews,Can Balioglu,Marta R. Costa-jussà,Onur Celebi,Maha Elbayad,Cynthia Gao,Francisco Guzmán,Justine Kao,Ann Lee,Alexandre Mourachko,Juan Pino,Sravya Popuri,Christophe Ropers,Safiyyah Saleem,Holger Schwenk,Paden Tomasello,Changhan Wang,Jeff Wang,Skyler Wang

Abstract:What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at <a class="link-external link-https" href="https://github.com/facebookresearch/seamless_communication" rel="external noopener nofollow">this https URL</a>

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates

Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Textless Speech-to-Speech Translation With Limited Parallel Data

MLS: A Large-Scale Multilingual Dataset for Speech Research

Common Voice: A Massively-Multilingual Speech Corpus

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation

LibriVoxDeEn: A Corpus for German-to-English Speech Translation and German Speech Recognition

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition.

MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Scaling Speech Technology to 1,000+ Languages

Speech Wikimedia: A 77 Language Multilingual Speech Dataset