Abstract:Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsible model deployment. We address this question by leveraging gradient-based feature attribution methods which produce saliency scores that capture the influence of input tokens on model output. Specifically, we probe several open-source LLMs to investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt compared to standard few-shot prompting, it increases the robustness of saliency scores to question perturbations and variations in model output.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to understand why Chain - of - Thought (CoT) prompts can improve the accuracy of large - scale language models (LLMs) in various question - answering tasks. Although CoT prompts have been proven to enhance model performance, the understanding of the underlying mechanisms remains insufficient. To ensure that this phenomenon is caused by the desired model behavior and to deploy these models responsibly, it is crucial to understand how CoT prompts work. Specifically, the authors address this problem in the following ways: 1. **Utilize gradient - based feature attribution methods**: These methods can generate saliency scores, which are used to capture the influence of input tokens on the model output. 2. **Explore how CoT prompts change the model's importance assignment to specific input tokens**: By analyzing the saliency scores of input tokens under different prompting methods, study whether and how CoT prompts affect the model's attention to the input. ### Research Background and Motivation With the rapid development of large - language models (such as models based on the Transformer architecture), both researchers and the public have shown great interest in them. However, the opacity of the internal mechanisms of these models makes it especially important to understand and interpret their behavior. Especially for new strategies like CoT prompts, understanding their working principles is crucial to ensure that the model's behavior is as expected, safe, and reliable. ### Main Research Questions 1. **Does the CoT prompt increase the saliency scores of semantically related tokens?** - The authors assume that the CoT prompt will make the model pay more attention to important input tokens, even when the input length increases. 2. **Does the CoT prompt make the model behavior more robust to question restatements?** - The authors assume that the CoT prompt can make the model have smaller changes in saliency scores when facing different formulations of the question, that is, the model focuses on relevant tokens more stably. 3. **Does the CoT prompt make the model gradients more stable in randomly generated outputs?** - The authors assume that the CoT prompt can reduce the variation of saliency scores between different outputs, thereby improving the model's robustness to the randomness of text generation. ### Experimental Design The authors used open - source models such as GPT - J (with 6 billion parameters) and conducted experiments on multiple question - answering datasets. By comparing the saliency scores under standard prompts and CoT prompts, the authors reached the following conclusions: - The CoT prompt does not significantly increase the saliency scores of semantically related tokens, but improves the model's accuracy on some datasets. - The CoT prompt makes the model more robust to question restatements, with smaller changes in saliency scores. - The CoT prompt makes the model gradients more stable in different outputs, and the variance of saliency scores decreases. ### Conclusion Although the CoT prompt does not significantly improve accuracy on smaller - scale models, it does change the way the model pays attention to input tokens, making it more stable and consistent. This indicates that the CoT prompt may improve performance by changing the internal processing mechanism of the model, not just by generating more reasonable explanations. In summary, this study provides a new perspective for understanding the working mechanism of CoT prompts and lays the foundation for further exploration of the behavior of large - scale language models.

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Unveiling the Statistical Foundations of Chain-of-Thought Prompting Methods

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

Stress Testing Chain-of-Thought Prompting for Large Language Models

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

Automatic Chain of Thought Prompting in Large Language Models

Chain-of-Thought Reasoning Without Prompting

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Active Prompting with Chain-of-Thought for Large Language Models

Chain-of-Thought Prompting for Speech Translation

Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks

Towards Better Chain-of-Thought Prompting Strategies: A Survey

Uncovering Latent Chain of Thought Vectors in Language Models

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Training Chain-of-Thought via Latent-Variable Inference

Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models