Xianzhen Luo,Qingfu Zhu,Zhiming Zhang,Libo Qin,Xuanyu Zhang,Qing Yang,Dongliang Xu,Wanxiang Che
Abstract:Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper attempts to address whether Python is always the best choice in the Program of Thoughts (PoT) approach. Specifically, while most current PoT research focuses on Python, the authors believe this may lead to locally optimal solutions, overlooking the potential advantages of other programming languages.
### Background and Motivation
1. **Current State of PoT Method**:
- PoT is a method that ensures logical computation accuracy through intermediate executable steps, widely used in code generation, image reasoning, financial Q&A, and robot control.
- Currently, PoT mainly uses Python, but this reliance on a single language may not be the best choice for all tasks and models.
2. **Advantages of Multiple Languages**:
- Different programming languages exhibit different effects on various tasks and models.
- A multi-language approach can leverage the unique advantages and diversity of various languages, improving overall performance.
### Main Contributions
1. **Experimental Validation**:
- The authors conducted comprehensive experiments comparing the performance of different programming languages across various tasks and models, finding that no single language is the best choice in all cases.
- Experimental results show that Python performs poorly in some tasks and models, while other languages like R and JavaScript perform better in specific scenarios.
2. **Proposing Multi-Language PoT (MultiPoT)**:
- MultiPoT is a task and model-independent method that synchronously generates PoTs in multiple languages and integrates the results through a voting mechanism.
- Experimental results show that MultiPoT significantly outperforms the Python self-consistency method and achieves or exceeds the performance of the best single-language PoT in almost all tasks and models.
### Conclusion
- **Python is Not Omnipotent**: Python performs poorly in some tasks and models, and relying on Python may lead to suboptimal solutions.
- **Importance of Multiple Languages**: Different programming languages exhibit unique advantages in different tasks and models, and a multi-language approach can significantly enhance performance.
- **Advantages of MultiPoT**: MultiPoT improves the average performance of tasks and models by integrating the advantages of multiple languages, performing particularly well on models like ChatGPT and Starcoder.
### Experimental Setup
- **Programming Languages**: Python, JavaScript, Java, C++, and R were selected for comparison.
- **Tasks**: Included mathematical applications, date processing, tabular data, spatial reasoning, and pure mathematical tasks.
- **Base Models**: Four large language models (LLMs) were selected, including ChatGPT and the three strongest code generation LLMs (Starcoder, Code Llama, and Deepseek Coder).
### Experimental Results
- **Performance Differences**: Significant performance differences were observed among different programming languages across various tasks and models, with no single language being the best choice in all cases.
- **Advantages of MultiPoT**: MultiPoT significantly outperformed the Python self-consistency method in almost all tasks and models and exceeded the performance of the best single-language PoT in some tasks.
### Discussion
- **Diversity and Adaptability of Languages**: Different languages exhibit unique advantages in different tasks and models, and a multi-language approach can better meet diverse task requirements.
- **Future Research Directions**: Further analysis of the performance of different languages in specific tasks and models, and exploration of more methods to optimize multi-language PoT.
Through this research, the authors hope to advance the development of the PoT method, making it more flexible and efficient, and applicable to a wider range of tasks and models.