Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Akshara Prabhakar,Thomas L. Griffiths,R. Thomas McCoy

2024-10-04

Abstract:Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this <a class="link-external link-https" href="https://github.com/aksh555/deciphering_cot" rel="external noopener nofollow">this https URL</a>

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand the reasoning ability of large - language models (LLMs) when using the Chain - of - Thought (CoT) prompting strategy. Specifically, the author hopes to explore the factors that affect the CoT reasoning effect by studying the decoding shift - cipher task. These factors include: 1. **Probability**: The probability of the correct output result occurring. The paper finds that when the probability of the correct answer is high, the CoT performance of LLMs is better. 2. **Memorization**: What the model has learned during the pre - training process. For example, some specific shift ciphers (such as rot - 13) are more common on the Internet, so the model performs better when dealing with these specific tasks. 3. **Noisy Reasoning**: Errors introduced during the reasoning process. The paper finds that as the task complexity increases (that is, more intermediate operations are required), the accuracy of the model will decline. By analyzing these factors, the author aims to reveal the mechanism of the CoT prompting strategy in improving the reasoning ability of LLMs and explore whether this improvement is based on abstract logical reasoning or a simple memory effect. The paper also proposes an intermediate view, believing that the behavior of LLMs contains both memory components and probabilistic reasoning components.

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Chain-of-Thought Reasoning Without Prompting

Stress Testing Chain-of-Thought Prompting for Large Language Models

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

When Do Program-of-Thought Works for Reasoning?

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Chain of Thoughtlessness? An Analysis of CoT in Planning

A comparison of chain-of-thought reasoning strategies across datasets and models

R$^3$ Prompting: Review, Rephrase and Resolve for Chain-of-Thought Reasoning in Large Language Models under Noisy Context

Supervised Chain of Thought

Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs

How Likely Do LLMs with CoT Mimic Human Reasoning?

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models