MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

Zhenpeng Su,Xing Wu,Xue Bai,Zijia Lin,Hui Chen,Guiguang Ding,Wei Zhou,Songlin Hu

2024-03-28

Abstract:Generative language models are usually pretrained on large text corpus via predicting the next token (i.e., sub-word/word/phrase) given the previous ones. Recent works have demonstrated the impressive performance of large generative language models on downstream tasks. However, existing generative language models generally neglect an inherent challenge in text corpus during training, i.e., the imbalance between frequent tokens and infrequent ones. It can lead a language model to be dominated by common and easy-to-learn tokens, thereby overlooking the infrequent and difficult-to-learn ones. To alleviate that, we propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens. During training, it can dynamically assess the learning difficulty of a to-be-learned token, according to the information entropy of the corresponding predicted probability distribution over the vocabulary. Then it scales the training loss adaptively, trying to lead the model to focus more on the difficult-to-learn tokens. On the Pile dataset, we train generative language models at different scales of 468M, 1.2B, and 6.7B parameters. Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.

Computation and Language

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address an inherent challenge in the training process of generative language models, specifically the imbalance between frequently occurring words (tokens) and infrequently occurring words in text corpora. Specifically: 1. **Word Frequency Imbalance**: In natural language datasets, frequently occurring words far outnumber infrequently occurring words. This imbalance can cause the model to overly focus on common and easy-to-learn words while neglecting rare and difficult-to-learn words. 2. **Learning Difficulty Bias**: Since infrequently occurring words appear less often in the dataset, their learning difficulty is higher. This bias can lead to a decline in the model's performance in downstream tasks. To mitigate this issue, the authors propose a new loss function—MiLe Loss, designed to alleviate the learning difficulty bias of words. MiLe Loss dynamically evaluates the learning difficulty of the words to be learned and adaptively adjusts the training loss based on information entropy, making the model pay more attention to those difficult-to-learn words.

MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Understanding and Mitigating Tokenization Bias in Language Models

Unlocking Continual Learning Abilities in Language Models

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Bias Amplification: Language Models as Increasingly Biased Media

MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting

Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench

Regurgitative Training: The Value of Real Data in Training Large Language Models

Deep Generative Mixture Model for Robust Imbalance Classification

Grimoire is All You Need for Enhancing Large Language Models

An Independence-promoting Loss for Music Generation with Language Models

Likelihood-based Mitigation of Evaluation Bias in Large Language Models

Nanolm: an Affordable LLM Pre-training Benchmark Via Accurate Loss Prediction Across Scales

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

Benchmarking Benchmark Leakage in Large Language Models

Beyond Accuracy Optimization: Computer Vision Losses for Large Language Model Fine-Tuning

LBPE: Long-token-first Tokenization to Improve Large Language Models

Large Margin Neural Language Model

Evaluating and Mitigating Linguistic Discrimination in Large Language Models

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model