Abstract:Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github <a class="link-external link-https" href="https://github.com/wenhuchen/Program-of-Thoughts" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the limitations of current language models in performing numerical reasoning tasks. Specifically, while traditional Chain of Thoughts (CoT) methods perform well in many numerical reasoning tasks, they still have the following issues: 1. **Arithmetic Calculation Errors**: Large Language Models (LLMs) are prone to arithmetic calculation errors when dealing with large numbers. 2. **Inadequate Ability to Solve Complex Mathematical Expressions**: LLMs cannot solve complex mathematical expressions, such as polynomial equations or differential equations. 3. **Inefficiency in Expressing Iterations**: LLMs are very inefficient in expressing when a large number of iterative steps are required. To overcome these issues, the paper proposes the "Program of Thoughts" (PoT) method. PoT delegates the computational steps to an external programming language interpreter (such as a Python interpreter), separating complex calculations from reasoning and language understanding. In this way, the language model only needs to generate code that describes the reasoning process, while the specific calculations are performed by the interpreter. This approach not only improves the accuracy of numerical reasoning but also enhances the model's expressive capabilities. ### Main Contributions 1. **Proposing the PoT Method**: Delegating computational steps to an external interpreter to improve the accuracy and efficiency of numerical reasoning tasks. 2. **Performance Evaluation**: Extensive experiments on multiple mathematical problem datasets and financial question-answering datasets show that PoT significantly outperforms CoT in both few-shot and zero-shot settings. 3. **Combining with Self-Consistency Decoding**: Further improving PoT's performance by combining it with Self-Consistency (SC) decoding. 4. **Ablation Study**: Conducting detailed ablation studies to explore the impact of different factors on PoT's performance, including different backend models and example sensitivity. ### Experimental Results - **Few-Shot Setting**: PoT shows an average performance improvement of about 12% on mathematical problem datasets and about 15% on financial datasets. - **Zero-Shot Setting**: PoT significantly outperforms CoT on all evaluated datasets, with an average performance improvement of about 12%. - **Self-Consistency Decoding**: After combining with self-consistency decoding, PoT's performance on mathematical problem datasets further improves, with an average performance improvement of about 20%. ### Conclusion The PoT method proposed in the paper effectively addresses the limitations of traditional CoT methods in numerical reasoning tasks by delegating computational steps to an external interpreter. Experimental results show that PoT significantly outperforms CoT in both few-shot and zero-shot settings, and its performance is further enhanced when combined with self-consistency decoding. These results demonstrate the great potential of PoT in numerical reasoning tasks.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Design of Chain-of-Thought in Math Problem Solving

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Markov Chain of Thought for Efficient Mathematical Reasoning

Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning

Towards understanding chain-of-thought prompting: An empirical study of what matters

Automatic prompt augmentation and selection with chain-of-thought from labeled data

When Do Program-of-Thought Works for Reasoning?

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting

Hint of Thought prompting: an explainable and zero-shot approach to reasoning tasks with LLMs

Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning

Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Achieving >97 Better Solvers for Math Word Problems

Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models

Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts