Abstract:Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that ''Not all tokens in a corpus are equally important for language model training''. Our initial analysis examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: during the pre - training process of language models, not all tokens in the corpus are of equal importance to model training. Traditional language model pre - training methods usually uniformly apply the next - token prediction loss to all training tokens, which may lead to a waste of computing resources and may limit the improvement of model performance. By analyzing the training dynamics of language models at the token level, the authors of the paper found that there are significant differences in the loss patterns of different tokens. Based on this observation, they proposed a new language model - RHO - 1, which adopts the Selective Language Modeling (SLM) method and only trains on those useful tokens that are consistent with the target distribution, thereby improving training efficiency and model performance. Specifically, RHO - 1 achieves its goals through the following steps: 1. **Reference model training**: First, train a reference model on a carefully curated high - quality dataset. This model is used to evaluate the loss of each token in the pre - training corpus. 2. **Token scoring**: Use the reference model to calculate the loss of each token and score the tokens according to these losses. 3. **Selective training**: During the pre - training process, only apply the loss function to those tokens with higher scores, thereby concentrating on training the most beneficial tokens. Through this method, RHO - 1 shows significant performance improvements in multiple benchmark tests, especially in math tasks. For example, during the continuous pre - training of 1.5 billion math - related tokens, the performance of RHO - 1 - 7B on the MATH dataset reaches 51.8%, while only using 3% of the number of tokens required by DeepSeekMath - 7B. In addition, RHO - 1 also performs well in general - domain pre - training, with an average performance improvement of 6.8%. In conclusion, the main contribution of this paper is the proposal of the SLM method, which improves the training efficiency of language models and the performance of downstream tasks by selectively training useful tokens.

Rho-1: Not All Tokens Are What You Need

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

1.5-Pints Technical Report: Pretraining in Days, Not Months -- Your Language Model Thrives on Quality Data

Tokenization Falling Short: On Subword Robustness in Large Language Models

A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

Language models scale reliably with over-training and on downstream tasks

An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking

Autonomous Data Selection with Language Models for Mathematical Texts

Better & Faster Large Language Models via Multi-token Prediction

An Improved Math Word Problem (MWP) Model Using Unified Pretrained Language Model (UniLM) for Pretraining

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Tokenizer Choice For LLM Training: Negligible or Crucial?

Bridging the Gap for Tokenizer-Free Language Models

TokenSHAP: Interpreting Large Language Models with Monte Carlo Shapley Value Estimation

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

LIMA: Less Is More for Alignment

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models

Token-level Direct Preference Optimization