Abstract:We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.

Finding the Optimal Vocabulary Size for Neural Machine Translation

Vocabulary Learning Via Optimal Transport for Neural Machine Translation

Vocabulary Manipulation for Neural Machine Translation

On Using Very Large Target Vocabulary for Neural Machine Translation

A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-tuning of NMT: A Case Study of In-domain Translation

Vocabulary Selection Strategies for Neural Machine Translation

On the Importance of Word Boundaries in Character-level Neural Machine Translation

A Study of Multilingual Neural Machine Translation

Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy

Optimizing Segmentation Granularity for Neural Machine Translation

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection

Scaling Laws for Neural Machine Translation

Massively Multilingual Neural Machine Translation

Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation

Revisiting Machine Translation for Cross-lingual Classification

Towards Integrated Classification Lexicon for Handling Unknown Words in Chinese-Vietnamese Neural Machine Translation

Finding Better Subword Segmentation for Neural Machine Translation

Massive Exploration of Neural Machine Translation Architectures

Modeling Vocabulary for Big Code Machine Learning

Machine Translation for Machines: the Sentiment Classification Use Case

An Investigation On Statistical Machine Translation With Neural Language Models