To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Zayne Sprague,Fangcong Yin,Juan Diego Rodriguez,Dongwei Jiang,Manya Wadhwa,Prasann Singhal,Xinyu Zhao,Xi Ye,Kyle Mahowald,Greg Durrett

2024-10-29

Abstract:Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in which types of tasks the Chain - of - Thought (CoT) method is truly beneficial. Specifically, the author evaluated more than 100 papers using the CoT method and 14 large - language models (LLMs) on 20 datasets through quantitative meta - analysis and their own experiments to explore the performance of CoT in different tasks. The study found that CoT mainly shows significant performance improvement in tasks involving mathematics or logic, while having little or no benefit in other types of tasks. In addition, the paper also analyzed the behavior of CoT in mathematical and symbolic reasoning tasks, divided it into two phases: planning and execution, and compared it with tool - enhanced LLMs. The research shows that CoT has improvement in the execution phase (i.e., performing calculations and symbolic operations), but performs poorly when compared with using symbolic solvers. These results indicate that the application of CoT can be carried out more selectively to maintain performance while saving inference costs, and new paradigms need to be explored to make better use of intermediate calculations, especially in areas other than mathematics and symbolic reasoning.

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Chain of Thoughtlessness? An Analysis of CoT in Planning

Towards understanding chain-of-thought prompting: An empirical study of what matters

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Markov Chain of Thought for Efficient Mathematical Reasoning

Supervised Chain of Thought

Design of Chain-of-Thought in Math Problem Solving

How Likely Do LLMs with CoT Mimic Human Reasoning?

CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

Automatic prompt augmentation and selection with chain-of-thought from labeled data

Towards revealing the mystery behind chain of thought: a theoretical perspective

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Chain-of-Thought Reasoning Without Prompting

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

A comparison of chain-of-thought reasoning strategies across datasets and models