Abstract:In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the capabilities of large - language models in mathematical reasoning and problem - solving, especially their performance when dealing with complex mathematical problems. Specifically, the research team has developed a series of large - language models specifically for mathematics (the Qwen2.5 - Math series) and introduced self - improvement techniques to enhance the performance of these models. ### Main problems and solutions: 1. **Improving mathematical reasoning ability**: - **Background**: Existing large - language models perform poorly in mathematical reasoning, mainly because of the lack of mathematical content in the pre - training data. - **Solution**: Enrich the pre - training data by constructing high - quality mathematical datasets (such as Qwen Math Corpus v1 and v2) to improve the model's mathematical reasoning ability. 2. **Automatically generating high - quality mathematical data**: - **Background**: Manually annotating mathematical problems and solutions is very time - consuming and costly. - **Solution**: Use the Qwen2 - Math - Instruct model to automatically generate large - scale, high - quality mathematical problems and their solutions, ensuring the quantity and quality of data while reducing the workload of manual annotation. 3. **Introduction of the reward model**: - **Background**: Relying solely on the final answer for supervised learning cannot provide sufficient feedback information, especially during complex reasoning processes. - **Solution**: Develop a reward model (Qwen2.5 - Math - RM) to evaluate the quality of the reasoning path, thereby guiding the training in the supervised fine - tuning (SFT) and reinforcement learning (RL) stages, enabling the model to better understand intermediate steps and reasoning logic. 4. **Multi - language support and tool - integrated reasoning**: - **Background**: Many existing models only support English and lack the ability to combine with external tools (such as Python interpreters) for accurate calculations. - **Solution**: The Qwen2.5 - Math series models support Chinese and English and introduce the tool - integrated reasoning (TIR) mode, allowing the model to call external tools for complex calculations during the reasoning process, further improving the accuracy of problem - solving. ### Evaluation results: - The Qwen2.5 - Math series models perform excellently on multiple mathematical benchmark tests (such as GSM8K, MATH, GaoKao, etc.), significantly outperforming leading open - source and closed - source models. - In particular, in high - difficulty competition questions such as AMC 2023, Qwen2.5 - Math - 72B - Instruct has almost solved all problems with the help of the reward model. - Even the smallest 1.5B - parameter model can achieve a score close to 80 when using a Python interpreter, surpassing many current models. In conclusion, this paper aims to significantly improve the capabilities of large - language models in mathematical reasoning and problem - solving through a series of innovative techniques and methods, enabling them to more effectively handle various complex mathematical problems.

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2.5 Technical Report

KwaiYiiMath: Technical Report

Qwen2 Technical Report

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Qwen Technical Report

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models

CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning

SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models

PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation

JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning