SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models

Dian Yu,Baolin Peng,Ye Tian,Linfeng Song,Haitao Mi,Dong Yu
2024-08-28
Abstract:There is a growing trend of teaching large language models (LLMs) to solve mathematical problems through coding. Existing studies primarily focus on prompting powerful, closed-source models to generate seed training data followed by in-domain data augmentation, equipping LLMs with considerable capabilities for code-aided mathematical reasoning. However, continually training these models on augmented data derived from a few datasets such as GSM8K may impair their generalization abilities and restrict their effectiveness to a narrow range of question types. Conversely, the potential of improving such LLMs by leveraging large-scale, expert-written, diverse math question-answer pairs remains unexplored. To utilize these resources and tackle unique challenges such as code response assessment, we propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. We also explore different alignment algorithms with self-generated instruction/preference data to foster continuous improvement. Experiments across both in-domain (up to +5.7%) and out-of-domain (+4.4%) benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of insufficient accuracy encountered by large language models (LLMs) when solving mathematical problems, especially in out-of-domain scenarios. Specifically, existing research mainly focuses on utilizing powerful closed-source models to generate seed training data and enhancing LLMs' code-assisted mathematical reasoning capabilities through in-domain data augmentation. However, this approach may lead to a decline in the model's generalization ability, limiting its effectiveness across various types of problems. The paper proposes a new paradigm that leverages large-scale, expert-written mathematical question-and-answer resources to improve LLMs' mathematical reasoning capabilities through code assistance. This paradigm includes the following key steps: 1. **Constructing the Initial Model**: First, fine-tune the LLM using high-quality seed data to generate the initial model. 2. **Establishing a Multi-purpose Code Evaluation Model**: To better utilize diverse mathematical Q&A data, the paper proposes constructing a code evaluation model to assess the consistency between code execution results and reference answers. 3. **Code Data Generation**: Generate code samples through the current policy model and use the code evaluation model to filter out correct code responses. 4. **Self-Improvement Mechanism**: Use unseen data for supervised fine-tuning (SFT) and further optimize the model through preference learning algorithms (such as DPO). Experimental results show that this paradigm can significantly enhance LLMs' performance on various mathematical benchmark tests, especially in out-of-domain tasks. Additionally, this approach reduces the need for specific language mathematical datasets, offering a certain degree of cross-language generalizability.