CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

Tianlong Wang,Junzhe Chen,Xueting Han,Jing Bai
2024-10-01
Abstract:Post-training, particularly reinforcement learning (RL) using self-play-generated data, has become a new learning paradigm for large language models (LLMs). However, scaling RL to develop a general reasoner remains a research challenge, as existing methods focus on task-specific reasoning without adequately addressing generalization across a broader range of tasks. Moreover, unlike traditional RL with limited action space, LLMs operate in an infinite space, making it crucial to search for valuable and diverse strategies to solve problems effectively. To address this, we propose searching within the action space on high-level abstract plans to enhance model generalization and introduce Critical Plan Step Learning (CPL), comprising: 1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates advantage estimates for step preference obtained via MCTS into Direct Preference Optimization (DPO). This combination helps the model effectively learn critical plan steps, enhancing both reasoning capabilities and generalization. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA (+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%).
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of enhancing the generalization ability of large language models (LLMs) in reasoning tasks through reinforcement learning (RL). Specifically: - **Research Challenges**: Although existing methods have made significant progress in reasoning on specific tasks, these methods mainly focus on specific tasks or domains (such as mathematics or programming) and fail to effectively address the generalization problem across different reasoning tasks. - **Differences in Action Space**: Compared to traditional RL methods, LLMs operate in an infinite action space, making it crucial to search for valuable and diverse strategies. - **Objective**: Propose a new method—Critical Plan Step Learning (CPL), which enhances the model's generalization ability by searching in the action space of high-level abstract plans. CPL consists of two parts: - Using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks. - Learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates the advantage estimates obtained from MCTS into Direct Preference Optimization (DPO). Experimental results show that CPL not only significantly improves performance on training datasets (such as GSM8K and MATH) but also achieves significant improvements on multiple out-of-domain reasoning benchmarks (such as HumanEval, GPQA, ARC-C, etc.), demonstrating its effectiveness in enhancing reasoning ability and generalization capability.