Rho-1: Not All Tokens Are What You Need

Zhenghao Lin,Zhibin Gou,Yeyun Gong,Xiao Liu,Yelong Shen,Ruochen Xu,Chen Lin,Yujiu Yang,Jian Jiao,Nan Duan,Weizhu Chen
2024-05-23
Abstract:Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that ''Not all tokens in a corpus are equally important for language model training''. Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: during the pre - training process of language models, not all tokens in the corpus are of equal importance to model training. Traditional language model pre - training methods usually uniformly apply the next - token prediction loss to all training tokens, which may lead to a waste of computing resources and may limit the improvement of model performance. By analyzing the training dynamics of language models at the token level, the authors of the paper found that there are significant differences in the loss patterns of different tokens. Based on this observation, they proposed a new language model - RHO - 1, which adopts the Selective Language Modeling (SLM) method and only trains on those useful tokens that are consistent with the target distribution, thereby improving training efficiency and model performance. Specifically, RHO - 1 achieves its goals through the following steps: 1. **Reference model training**: First, train a reference model on a carefully curated high - quality dataset. This model is used to evaluate the loss of each token in the pre - training corpus. 2. **Token scoring**: Use the reference model to calculate the loss of each token and score the tokens according to these losses. 3. **Selective training**: During the pre - training process, only apply the loss function to those tokens with higher scores, thereby concentrating on training the most beneficial tokens. Through this method, RHO - 1 shows significant performance improvements in multiple benchmark tests, especially in math tasks. For example, during the continuous pre - training of 1.5 billion math - related tokens, the performance of RHO - 1 - 7B on the MATH dataset reaches 51.8%, while only using 3% of the number of tokens required by DeepSeekMath - 7B. In addition, RHO - 1 also performs well in general - domain pre - training, with an average performance improvement of 6.8%. In conclusion, the main contribution of this paper is the proposal of the SLM method, which improves the training efficiency of language models and the performance of downstream tasks by selectively training useful tokens.