Abstract:Enhancing the reasoning capabilities of large language models (LLMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making. Humans excel at these tasks by leveraging deliberate planning with an internal world model to simulate the potential outcomes of various actions. Inspired by this, we propose a novel multi-step reasoning framework for LLMs, referred to as Structure-aware Planning with Accurate World Model (SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT) reasoning in natural language, SWAP incorporates structural information to guide the reasoning process via a world model and provides a soft verification mechanism over the steps. Moreover, SWAP overcomes the challenge of accurate world state predictions in complex reasoning tasks by introducing a Generator-Discriminator architecture, which enables more reliable world modeling. Specifically, the generator predicts the next state, and the discriminator ensures alignment with the logical consistency required by the problem context. SWAP also encourages the policy model to explore a broad range of potential actions to prevent premature convergence. By resolving the bottlenecks of generation diversity for both actions and states using diversity-based modeling (DBM) and improving discrimination accuracy through contrastive ranking (CR), SWAP significantly enhances the reasoning performance of LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP achieves substantial improvements over the baselines and consistently outperforms existing LLMs of similar sizes.

What problem does this paper attempt to address?

This paper attempts to address the problem of poor performance of large - language models (LLMs) in complex reasoning tasks. Specifically, although LLMs have made significant progress in many fields, they still have limitations in complex reasoning tasks that require multi - step decision - making. Humans perform excellently in these tasks and are able to conduct deliberate planning through internal world models and simulate the potential outcomes of different actions. Inspired by this, the paper proposes a new multi - step reasoning framework, called Structure - aware Planning with Accurate World - model (SWAP), aiming to enhance the reasoning ability of LLMs. ### Main Contributions: 1. **Structure - aware Planning**: SWAP introduces entailment graphs, visualizing how preconditions lead to intermediate conclusions and the correctness verification process of the final answer, increasing the coherence of the reasoning process and logical verification. 2. **Accurate World - model**: An accurate world - model is achieved through the Generator - Discriminator architecture, solving the problems of generation diversity and discrimination accuracy, thereby improving reasoning performance. 3. **Extensive Experimental Verification**: In a variety of reasoning benchmark tests, SWAP shows significant improvement, especially in mathematical reasoning, logical reasoning, and programming tasks, significantly outperforming existing LLMs. ### Method Overview: - **Task Modeling**: Complex reasoning tasks are modeled as Markov decision processes (MDP), where states represent currently known or inferred information, actions represent the process of deriving new information based on the current state, transition probabilities describe the probability of transitioning to the next state after taking an action, and the scoring function is used to quantify the quality of an action in the current state. - **Structured Reasoning**: An entailment graph is constructed to represent how preconditions lead to intermediate conclusions and ultimately verify the correctness of the final answer. This helps the model make more informed decisions during the reasoning process. - **Diversity Generation**: Through the Diversity - based Modeling (DBM) method, the generator is encouraged to generate different solutions, avoiding repetition and self - bias, thereby exploring a wider range of valid paths. - **Discrimination Accuracy Improvement**: The Contrastive Ranking (CR) method is adopted. By relatively comparing candidate solutions, the accuracy of the discriminator is improved, simplifying the task of identifying error - prone parts. ### Experimental Results: - **Overall Performance**: SWAP performs excellently in multiple benchmark tests, especially on the mathematical reasoning (MATH) and math word problems (GSM8K) datasets, increasing the accuracy of the baseline model LLaMA3 - 8B - Instruct by 14.7% and 10.3% respectively. - **Influence of Search Tree Width and Depth**: Increasing the width of the search tree can improve accuracy to a certain extent, but the returns diminish after exceeding a certain threshold. For example, in the FOLIO and GSM8K datasets, the returns gradually decrease after the number of search attempts exceeds 5 - 7 times. In conclusion, this paper significantly improves the performance of LLMs in complex reasoning tasks by introducing structure - aware planning and accurate world - models, providing a new direction for future research.

Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Reasoning with Language Model is Planning with World Model

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Improving Planning with Large Language Models: A Modular Agentic Architecture

A Human-Like Reasoning Framework for Multi-Phases Planning Task with Large Language Models

Can LLMs Fix Issues with Reasoning Models? Towards More Likely Models for AI Planning

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model

Guiding Language Model Reasoning with Planning Tokens

Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments

LLM-SAP: Large Language Models Situational Awareness Based Planning

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Can LLMs Reason in the Wild with Programs?

Eliminating Reasoning via Inferring with Planning: A New Framework to Guide LLMs' Non-linear Thinking

Concise and Organized Perception Facilitates Reasoning in Large Language Models

On the Empirical Complexity of Reasoning and Planning in LLMs

Enhancing Language Model Reasoning via Weighted Reasoning in Self-Consistency