Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen,Xueguang Ma,Xinyi Wang,William W. Cohen
2023-10-23
Abstract:Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github <a class="link-external link-https" href="https://github.com/wenhuchen/Program-of-Thoughts" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the limitations of current language models in performing numerical reasoning tasks. Specifically, while traditional Chain of Thoughts (CoT) methods perform well in many numerical reasoning tasks, they still have the following issues: 1. **Arithmetic Calculation Errors**: Large Language Models (LLMs) are prone to arithmetic calculation errors when dealing with large numbers. 2. **Inadequate Ability to Solve Complex Mathematical Expressions**: LLMs cannot solve complex mathematical expressions, such as polynomial equations or differential equations. 3. **Inefficiency in Expressing Iterations**: LLMs are very inefficient in expressing when a large number of iterative steps are required. To overcome these issues, the paper proposes the "Program of Thoughts" (PoT) method. PoT delegates the computational steps to an external programming language interpreter (such as a Python interpreter), separating complex calculations from reasoning and language understanding. In this way, the language model only needs to generate code that describes the reasoning process, while the specific calculations are performed by the interpreter. This approach not only improves the accuracy of numerical reasoning but also enhances the model's expressive capabilities. ### Main Contributions 1. **Proposing the PoT Method**: Delegating computational steps to an external interpreter to improve the accuracy and efficiency of numerical reasoning tasks. 2. **Performance Evaluation**: Extensive experiments on multiple mathematical problem datasets and financial question-answering datasets show that PoT significantly outperforms CoT in both few-shot and zero-shot settings. 3. **Combining with Self-Consistency Decoding**: Further improving PoT's performance by combining it with Self-Consistency (SC) decoding. 4. **Ablation Study**: Conducting detailed ablation studies to explore the impact of different factors on PoT's performance, including different backend models and example sensitivity. ### Experimental Results - **Few-Shot Setting**: PoT shows an average performance improvement of about 12% on mathematical problem datasets and about 15% on financial datasets. - **Zero-Shot Setting**: PoT significantly outperforms CoT on all evaluated datasets, with an average performance improvement of about 12%. - **Self-Consistency Decoding**: After combining with self-consistency decoding, PoT's performance on mathematical problem datasets further improves, with an average performance improvement of about 20%. ### Conclusion The PoT method proposed in the paper effectively addresses the limitations of traditional CoT methods in numerical reasoning tasks by delegating computational steps to an external interpreter. Experimental results show that PoT significantly outperforms CoT in both few-shot and zero-shot settings, and its performance is further enhanced when combined with self-consistency decoding. These results demonstrate the great potential of PoT in numerical reasoning tasks.