Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model

Siheng Xiong,Ali Payani,Yuan Yang,Faramarz Fekri
2024-10-04
Abstract:Enhancing the reasoning capabilities of large language models (LLMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making. Humans excel at these tasks by leveraging deliberate planning with an internal world model to simulate the potential outcomes of various actions. Inspired by this, we propose a novel multi-step reasoning framework for LLMs, referred to as Structure-aware Planning with Accurate World Model (SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT) reasoning in natural language, SWAP incorporates structural information to guide the reasoning process via a world model and provides a soft verification mechanism over the steps. Moreover, SWAP overcomes the challenge of accurate world state predictions in complex reasoning tasks by introducing a Generator-Discriminator architecture, which enables more reliable world modeling. Specifically, the generator predicts the next state, and the discriminator ensures alignment with the logical consistency required by the problem context. SWAP also encourages the policy model to explore a broad range of potential actions to prevent premature convergence. By resolving the bottlenecks of generation diversity for both actions and states using diversity-based modeling (DBM) and improving discrimination accuracy through contrastive ranking (CR), SWAP significantly enhances the reasoning performance of LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP achieves substantial improvements over the baselines and consistently outperforms existing LLMs of similar sizes.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the problem of poor performance of large - language models (LLMs) in complex reasoning tasks. Specifically, although LLMs have made significant progress in many fields, they still have limitations in complex reasoning tasks that require multi - step decision - making. Humans perform excellently in these tasks and are able to conduct deliberate planning through internal world models and simulate the potential outcomes of different actions. Inspired by this, the paper proposes a new multi - step reasoning framework, called Structure - aware Planning with Accurate World - model (SWAP), aiming to enhance the reasoning ability of LLMs. ### Main Contributions: 1. **Structure - aware Planning**: SWAP introduces entailment graphs, visualizing how preconditions lead to intermediate conclusions and the correctness verification process of the final answer, increasing the coherence of the reasoning process and logical verification. 2. **Accurate World - model**: An accurate world - model is achieved through the Generator - Discriminator architecture, solving the problems of generation diversity and discrimination accuracy, thereby improving reasoning performance. 3. **Extensive Experimental Verification**: In a variety of reasoning benchmark tests, SWAP shows significant improvement, especially in mathematical reasoning, logical reasoning, and programming tasks, significantly outperforming existing LLMs. ### Method Overview: - **Task Modeling**: Complex reasoning tasks are modeled as Markov decision processes (MDP), where states represent currently known or inferred information, actions represent the process of deriving new information based on the current state, transition probabilities describe the probability of transitioning to the next state after taking an action, and the scoring function is used to quantify the quality of an action in the current state. - **Structured Reasoning**: An entailment graph is constructed to represent how preconditions lead to intermediate conclusions and ultimately verify the correctness of the final answer. This helps the model make more informed decisions during the reasoning process. - **Diversity Generation**: Through the Diversity - based Modeling (DBM) method, the generator is encouraged to generate different solutions, avoiding repetition and self - bias, thereby exploring a wider range of valid paths. - **Discrimination Accuracy Improvement**: The Contrastive Ranking (CR) method is adopted. By relatively comparing candidate solutions, the accuracy of the discriminator is improved, simplifying the task of identifying error - prone parts. ### Experimental Results: - **Overall Performance**: SWAP performs excellently in multiple benchmark tests, especially on the mathematical reasoning (MATH) and math word problems (GSM8K) datasets, increasing the accuracy of the baseline model LLaMA3 - 8B - Instruct by 14.7% and 10.3% respectively. - **Influence of Search Tree Width and Depth**: Increasing the width of the search tree can improve accuracy to a certain extent, but the returns diminish after exceeding a certain threshold. For example, in the FOLIO and GSM8K datasets, the returns gradually decrease after the number of search attempts exceeds 5 - 7 times. In conclusion, this paper significantly improves the performance of LLMs in complex reasoning tasks by introducing structure - aware planning and accurate world - models, providing a new direction for future research.