Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Akshara Prabhakar,Thomas L. Griffiths,R. Thomas McCoy
2024-10-04
Abstract:Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this <a class="link-external link-https" href="https://github.com/aksh555/deciphering_cot" rel="external noopener nofollow">this https URL</a>
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to understand the reasoning ability of large - language models (LLMs) when using the Chain - of - Thought (CoT) prompting strategy. Specifically, the author hopes to explore the factors that affect the CoT reasoning effect by studying the decoding shift - cipher task. These factors include: 1. **Probability**: The probability of the correct output result occurring. The paper finds that when the probability of the correct answer is high, the CoT performance of LLMs is better. 2. **Memorization**: What the model has learned during the pre - training process. For example, some specific shift ciphers (such as rot - 13) are more common on the Internet, so the model performs better when dealing with these specific tasks. 3. **Noisy Reasoning**: Errors introduced during the reasoning process. The paper finds that as the task complexity increases (that is, more intermediate operations are required), the accuracy of the model will decline. By analyzing these factors, the author aims to reveal the mechanism of the CoT prompting strategy in improving the reasoning ability of LLMs and explore whether this improvement is based on abstract logical reasoning or a simple memory effect. The paper also proposes an intermediate view, believing that the behavior of LLMs contains both memory components and probabilistic reasoning components.