CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

Tianlong Wang,Junzhe Chen,Xueting Han,Jing Bai

2024-10-01

Abstract:Post-training, particularly reinforcement learning (RL) using self-play-generated data, has become a new learning paradigm for large language models (LLMs). However, scaling RL to develop a general reasoner remains a research challenge, as existing methods focus on task-specific reasoning without adequately addressing generalization across a broader range of tasks. Moreover, unlike traditional RL with limited action space, LLMs operate in an infinite space, making it crucial to search for valuable and diverse strategies to solve problems effectively. To address this, we propose searching within the action space on high-level abstract plans to enhance model generalization and introduce Critical Plan Step Learning (CPL), comprising: 1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates advantage estimates for step preference obtained via MCTS into Direct Preference Optimization (DPO). This combination helps the model effectively learn critical plan steps, enhancing both reasoning capabilities and generalization. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA (+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%).

Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper attempts to address the problem of enhancing the generalization ability of large language models (LLMs) in reasoning tasks through reinforcement learning (RL). Specifically: - **Research Challenges**: Although existing methods have made significant progress in reasoning on specific tasks, these methods mainly focus on specific tasks or domains (such as mathematics or programming) and fail to effectively address the generalization problem across different reasoning tasks. - **Differences in Action Space**: Compared to traditional RL methods, LLMs operate in an infinite action space, making it crucial to search for valuable and diverse strategies. - **Objective**: Propose a new method—Critical Plan Step Learning (CPL), which enhances the model's generalization ability by searching in the action space of high-level abstract plans. CPL consists of two parts: - Using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks. - Learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates the advantage estimates obtained from MCTS into Direct Preference Optimization (DPO). Experimental results show that CPL not only significantly improves performance on training datasets (such as GSM8K and MATH) but also achieves significant improvements on multiple out-of-domain reasoning benchmarks (such as HumanEval, GPQA, ARC-C, etc.), demonstrating its effectiveness in enhancing reasoning ability and generalization capability.

CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Learning to Plan by Updating Natural Language

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Guiding Language Model Reasoning with Planning Tokens

Reasoning with Language Model is Planning with World Model

Non-myopic Generation of Language Model for Reasoning and Planning

Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

On the Planning Abilities of Large Language Models : A Critical Investigation

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

Let's reward step by step: Step-Level reward model as the Navigators for Reasoning

Furthest Reasoning with Plan Assessment: Stable Reasoning Path with Retrieval-Augmented Large Language Models

Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning