Abstract:Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from poor robustness and cross-task generalization. To address the limitation, we introduce CodePlan, a scalable framework that empowers LLMs to generate and follow \textit{code-form plans} -- pseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. Importantly, CodePlan allows automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve LLM's reasoning capabilities across diverse scenarios. To train CodePlan, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CodePlan achieves a 25.1\% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop QA, and decision-making tasks. Further analysis reveals CodePlan's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of planning ability of large - language models (LLMs) when handling complex multi - step reasoning tasks. Although LLMs perform excellently in traditional natural - language - processing tasks, their planning ability remains a crucial bottleneck in tasks requiring multi - step reasoning. Existing methods mainly rely on prompting or task - specific fine - tuning, and these methods usually have problems such as poor robustness and insufficient cross - task generalization ability. To overcome this limitation, the authors propose CODEPLAN, which is an extensible framework that enables LLMs to generate and follow code - like plans - namely, pseudocode, which is used to outline high - level, structured reasoning processes. By leveraging the structured and multi - functional nature of code, CODEPLAN can effectively capture the rich semantics and control flow in complex reasoning tasks. Moreover, CODEPLAN allows for the automatic extraction of code - like plans from large and extensive text corpora without the need for specially curated task - specific datasets, thereby achieving efficient scaling and enhancing the reasoning ability of LLMs in various scenarios.
Specifically, the main contributions of the paper include:
1. **Introduction of CODEPLAN**: A new, extensible framework that enables LLMs to generate and follow code - like plans, which outline high - level, structured reasoning processes in the form of pseudocode. This framework unlocks new frontiers for LLMs in structured reasoning, surpassing the limitations imposed by the planning signals implicit in natural - language texts.
2. **Efficient training - data construction**: CODEPLAN allows for the efficient and cost - effective construction of training data from large - scale, extensive datasets. The authors have constructed a large - scale dataset containing 2 million examples for this purpose, which integrate standard prompt - response pairs with code - like plans.
3. **Verification on multiple models**: The paper verifies the effectiveness and universality of CODEPLAN on multiple backbone models, including Mistral and Llama series models. The experimental results show that, compared with directly generating responses, CODEPLAN has an average relative performance improvement of 25.1% in 13 challenging reasoning benchmarks, which cover mathematical reasoning, symbolic reasoning, instruction following, multi - hop question - answering, and decision - making tasks. Further analysis shows that as the complexity of the problem increases, the performance advantage of CODEPLAN gradually increases, and it also has strong data efficiency.