MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

Zhenpeng Su,Xing Wu,Xue Bai,Zijia Lin,Hui Chen,Guiguang Ding,Wei Zhou,Songlin Hu
2024-03-28
Abstract:Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address an inherent challenge in the training process of generative language models, specifically the imbalance between frequently occurring words (tokens) and infrequently occurring words in text corpora. Specifically: 1. **Word Frequency Imbalance**: In natural language datasets, frequently occurring words far outnumber infrequently occurring words. This imbalance can cause the model to overly focus on common and easy-to-learn words while neglecting rare and difficult-to-learn words. 2. **Learning Difficulty Bias**: Since infrequently occurring words appear less often in the dataset, their learning difficulty is higher. This bias can lead to a decline in the model's performance in downstream tasks. To mitigate this issue, the authors propose a new loss function—MiLe Loss, designed to alleviate the learning difficulty bias of words. MiLe Loss dynamically evaluates the learning difficulty of the words to be learned and adaptively adjusts the training loss based on information entropy, making the model pay more attention to those difficult-to-learn words.