Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan,Hongyi Yuan,Chengpeng Li,Guanting Dong,Keming Lu,Chuanqi Tan,Chang Zhou,Jingren Zhou
2023-09-13
Abstract:Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the performance of Large Language Models (LLMs) in mathematical reasoning tasks and investigates the impact of pre-training loss, the amount of supervised data, and the volume of augmented data on the model's mathematical reasoning capabilities. ### Research Questions The paper attempts to address the following core questions: 1. **Relationship between Pre-training Loss and Model Performance**: The study examines how pre-training loss affects the performance of models during Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). 2. **Impact of Supervised Data Volume**: It analyzes the effect of varying amounts of supervised data on the model's mathematical reasoning performance. 3. **Role of Augmented Data**: The paper explores how to generate more correct reasoning paths through Rejection Sampling (RFT) and use these paths to further enhance the model's mathematical reasoning ability. ### Main Findings - **Pre-training Loss and Performance**: The research finds that as pre-training loss decreases, indicating an improvement in pre-trained model quality, the model's SFT and ICL performance shows a linear growth trend. However, this growth rate gradually slows down as the model quality improves. - **Supervised Data Volume and Performance**: There is a logarithmic linear relationship between model performance and the volume of supervised data. Increasing the data volume can improve model performance, but the gains diminish gradually for models with better pre-training effects. - **Effectiveness of Augmented Data**: Data obtained through Rejection Sampling (RFT) significantly enhances the model's mathematical reasoning capabilities. The key factor is the number of different reasoning paths generated. Combining augmented samples from multiple models can further improve performance. ### Methodology - **Supervised Fine-Tuning (SFT)**: The model is trained in a supervised manner using a dataset of labeled mathematical problems. - **Rejection Sampling (RFT)**: Correct reasoning paths are generated using existing models as additional training data to enhance the model's reasoning capabilities. - **Pre-training Loss Analysis**: Comparing the loss values of different pre-trained models with their performance on mathematical reasoning tasks. ### Experimental Results - Models with lower pre-training loss perform well on mathematical reasoning tasks, and SFT performance increases linearly with the decrease in pre-training loss, but the rate of increase gradually slows down. - Increasing the volume of supervised data effectively improves model performance, especially for smaller models. For larger models, however, the performance gains are more limited. - The RFT method can significantly improve model performance, particularly for models with poorer performance. Combining augmented samples from multiple models can further enhance performance, achieving better results than a single model. In summary, the paper delves into the factors affecting the mathematical reasoning capabilities of large language models and proposes a simple yet effective Rejection Sampling method to improve model performance.