Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

Fangkai Jiao,Chengwei Qin,Zhengyuan Liu,Nancy F. Chen,Shafiq Joty

2024-10-15

Abstract:Large Language Models (LLMs) have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning as planning, while others focus on annotating for process supervision. Nevertheless, the planning-based search process often results in high latency due to the frequent assessment of intermediate reasoning states and the extensive exploration space. Additionally, supervising the reasoning process with human annotation is costly and challenging to scale for LLM training. To address these issues, in this paper, we propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories, which are ranked according to synthesized process rewards. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework, showing that our 7B model can surpass the strong counterparts like GPT-3.5-Turbo.

Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the hallucinations and deficiencies in reasoning processes that large language models (LLMs) exhibit when handling complex reasoning tasks. Although LLMs show significant potential in generating step-by-step reasoning, the reasoning processes they generate are often misleading and inaccurate, especially in complex reasoning scenarios. Additionally, existing improvement methods, such as planning-based search and human process supervision, while effective, suffer from high latency and high costs. Specifically, the paper attempts to solve the following issues: 1. **Hallucinations and Reasoning Deficiencies**: LLMs tend to produce misleading conclusions when generating reasoning processes, which is particularly evident in complex reasoning tasks. 2. **High Latency**: Planning-based search methods result in lengthy reasoning processes due to frequent evaluations of intermediate reasoning states and exploration of numerous possible paths. 3. **High Costs**: The cost of human annotation for process supervision is high and challenging to scale for LLM training. To address these issues, the paper proposes a new framework that learns planning-based reasoning by collecting trajectories and using synthetic process rewards. This framework leverages offline simulation and trajectory collection to avoid the high latency of online planning and reduces reliance on human annotations. Experimental results show that this method achieves significant performance improvements in logical reasoning and mathematical reasoning tasks, outperforming existing baseline models.

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

A Human-Like Reasoning Framework for Multi-Phases Planning Task with Large Language Models

Reasoning with Language Model is Planning with World Model

Explicit Planning Helps Language Models in Logical Reasoning

Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop Visual Reasoning

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

Non-myopic Generation of Language Model for Reasoning and Planning

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Non-myopic Generation of Language Models for Reasoning and Planning

Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference

Reason for Future, Act for Now: A Principled Architecture for Autonomous LLM Agents

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Thought-Like-Pro: Enhancing Reasoning of Large Language Models through Self-Driven Prolog-based Chain-of-Thought

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Plan of Thoughts: Heuristic-Guided Problem Solving with Large Language Models

Language Model Non-myopic Generation for Reasoning and Planning

Learning to Plan by Updating Natural Language

Guiding Language Model Reasoning with Planning Tokens