NLLB Team,Marta R. Costa-jussà,James Cross,Onur Çelebi,Maha Elbayad,Kenneth Heafield,Kevin Heffernan,Elahe Kalbassi,Janice Lam,Daniel Licht,Jean Maillard,Anna Sun,Skyler Wang,Guillaume Wenzek,Al Youngblood,Bapi Akula,Loic Barrault,Gabriel Mejia Gonzalez,Prangthip Hansanti,John Hoffman,Semarley Jarrett,Kaushik Ram Sadagopan,Dirk Rowe,Shannon Spruit,Chau Tran,Pierre Andrews,Necip Fazil Ayan,Shruti Bhosale,Sergey Edunov,Angela Fan,Cynthia Gao,Vedanuj Goswami,Francisco Guzmán,Philipp Koehn,Alexandre Mourachko,Christophe Ropers,Safiyyah Saleem,Holger Schwenk,Jeff Wang

Abstract:The development of neural techniques has opened up new avenues for research in machine translation. Today, neural machine translation (NMT) systems can leverage highly multilingual capacities and even perform zero-shot translation, delivering promising results in terms of language coverage and quality. However, scaling quality NMT requires large volumes of parallel bilingual data, which are not equally available for the 7,000+ languages in the world 1 . Focusing on improving the translation qualities of a relatively small group of high-resource languages comes at the expense of directing research attention to low-resource languages, exacerbating digital inequities in the long run. To break this pattern, here we introduce No Language Left Behind—a single massively multilingual model that leverages transfer learning across languages. We developed a conditional computational model based on the Sparsely Gated Mixture of Experts architecture 2,3,4,5,6,7 , which we trained on data obtained with new mining techniques tailored for low-resource languages. Furthermore, we devised multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. We evaluated the performance of our model over 40,000 translation directions using tools created specifically for this purpose—an automatic benchmark (FLORES-200), a human evaluation metric (XSTS) and a toxicity detector that covers every language in our model. Compared with the previous state-of-the-art models, our model achieves an average of 44% improvement in translation quality as measured by BLEU. By demonstrating how to scale NMT to 200 languages and making all contributions in this effort freely available for non-commercial use, our work lays important groundwork for the development of a universal translation system.

Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task

Scaling Laws for Neural Machine Translation

Scaling Laws for Multilingual Neural Machine Translation

Scaling Laws for Multilingual Language Models

Scaling Law for Document Neural Machine Translation

Data Scaling Laws in NMT: The Effect of Noise and Architecture

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Revisiting Neural Scaling Laws in Language and Vision

Max-Violation Perceptron and Forced Decoding for Scalable MT Training.

Scaling Laws for Neural Language Models

Scaling End-to-End Models for Large-Scale Multilingual ASR

Scaling Laws for Downstream Task Performance of Large Language Models

Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

Decoding with Large-Scale Neural Language Models Improves Translation.

Scaling neural machine translation to 200 languages

Language models scale reliably with over-training and on downstream tasks

Machine Translation with Large Language Models: Decoder Only vs. Encoder-Decoder

Scaling laws for language encoding models in fMRI

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Parallelizing and Optimizing Neural Encoder–Decoder Models Without Padding on Multi-Core Architecture