Abstract:Program of Thoughts (PoT) is an approach characterized by its executable intermediate steps, which ensure the accuracy of the logical calculations in the reasoning process. Currently, PoT primarily uses Python. However, relying solely on a single language may result in suboptimal solutions and overlook the potential benefits of other programming languages. In this paper, we conduct comprehensive experiments on the programming languages used in PoT and find that no single language consistently delivers optimal performance across all tasks and models. The effectiveness of each language varies depending on the specific scenarios. Inspired by this, we propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. Furthermore, it achieves comparable or superior performance compared to the best monolingual PoT in almost all tasks across all models. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701).

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address whether Python is always the best choice in the Program of Thoughts (PoT) approach. Specifically, while most current PoT research focuses on Python, the authors believe this may lead to locally optimal solutions, overlooking the potential advantages of other programming languages. ### Background and Motivation 1. **Current State of PoT Method**: - PoT is a method that ensures logical computation accuracy through intermediate executable steps, widely used in code generation, image reasoning, financial Q&A, and robot control. - Currently, PoT mainly uses Python, but this reliance on a single language may not be the best choice for all tasks and models. 2. **Advantages of Multiple Languages**: - Different programming languages exhibit different effects on various tasks and models. - A multi-language approach can leverage the unique advantages and diversity of various languages, improving overall performance. ### Main Contributions 1. **Experimental Validation**: - The authors conducted comprehensive experiments comparing the performance of different programming languages across various tasks and models, finding that no single language is the best choice in all cases. - Experimental results show that Python performs poorly in some tasks and models, while other languages like R and JavaScript perform better in specific scenarios. 2. **Proposing Multi-Language PoT (MultiPoT)**: - MultiPoT is a task and model-independent method that synchronously generates PoTs in multiple languages and integrates the results through a voting mechanism. - Experimental results show that MultiPoT significantly outperforms the Python self-consistency method and achieves or exceeds the performance of the best single-language PoT in almost all tasks and models. ### Conclusion - **Python is Not Omnipotent**: Python performs poorly in some tasks and models, and relying on Python may lead to suboptimal solutions. - **Importance of Multiple Languages**: Different programming languages exhibit unique advantages in different tasks and models, and a multi-language approach can significantly enhance performance. - **Advantages of MultiPoT**: MultiPoT improves the average performance of tasks and models by integrating the advantages of multiple languages, performing particularly well on models like ChatGPT and Starcoder. ### Experimental Setup - **Programming Languages**: Python, JavaScript, Java, C++, and R were selected for comparison. - **Tasks**: Included mathematical applications, date processing, tabular data, spatial reasoning, and pure mathematical tasks. - **Base Models**: Four large language models (LLMs) were selected, including ChatGPT and the three strongest code generation LLMs (Starcoder, Code Llama, and Deepseek Coder). ### Experimental Results - **Performance Differences**: Significant performance differences were observed among different programming languages across various tasks and models, with no single language being the best choice in all cases. - **Advantages of MultiPoT**: MultiPoT significantly outperformed the Python self-consistency method in almost all tasks and models and exceeded the performance of the best single-language PoT in some tasks. ### Discussion - **Diversity and Adaptability of Languages**: Different languages exhibit unique advantages in different tasks and models, and a multi-language approach can better meet diverse task requirements. - **Future Research Directions**: Further analysis of the performance of different languages in specific tasks and models, and exploration of more methods to optimize multi-language PoT. Through this research, the authors hope to advance the development of the PoT method, making it more flexible and efficient, and applicable to a wider range of tasks and models.

Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts

MultiPoT: Multilingual Program of Thoughts Harnesses Multiple Programming Languages

How Do Humans Write Code? Large Models Do It the Same Way Too

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

MoT: Memory-of-Thought Enables ChatGPT to Self-Improve

Design of Chain-of-Thought in Math Problem Solving

PoTable: Programming Standardly on Table-based Reasoning Like a Human Analyst

Effectiveness of ChatGPT in Coding: A Comparative Analysis of Popular Large Language Models

Polyglot Prompt: Multilingual Multitask PrompTraining

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Using cognitive psychology to understand GPT-3

Enhancing English abstract quality for non-English speaking authors using ChatGPT: A comparative study of Taiwan, Japan, China, and South Korea with slope graphs

Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation

Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

What Is Missing in Multilingual Visual Reasoning and How to Fix It

Parrot: Multilingual Visual Instruction Tuning

Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting

The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions

Applying Large Language Model to a Control System for Multi-Robot Task Assignment