Abstract:In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the dominant technique for performing natural language tasks, as it does not require updating the model parameters with gradient-based methods. ICL promises to "adapt" the LLM to perform the present task at a competitive or state-of-the-art level at a fraction of the computational cost. ICL can be augmented by incorporating the reasoning process to arrive at the final label explicitly in the prompt, a technique called Chain-of-Thought (CoT) prompting. However, recent work has found that ICL relies mostly on the retrieval of task priors and less so on "learning" to perform tasks, especially for complex subjective domains like emotion and morality, where priors ossify posterior predictions. In this work, we examine whether "enabling" reasoning also creates the same behavior in LLMs, wherein the format of CoT retrieves reasoning priors that remain relatively unchanged despite the evidence in the prompt. We find that, surprisingly, CoT indeed suffers from the same posterior collapse as ICL for larger language models. Code is avalaible at <a class="link-external link-https" href="https://github.com/gchochla/cot-priors" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is whether Chain - of - Thought (CoT) prompts can overcome the strong influence of prior knowledge on posterior prediction in large - language models (LLMs), especially when dealing with complex subjective tasks such as emotion and moral judgment. Specifically, the author explores the following points: 1. **Performance comparison between CoT and traditional in - context learning (ICL)**: Research whether CoT performs better than the traditional ICL method under multi - shot conditions. 2. **The prior knowledge problem of CoT**: Analyze whether CoT, like ICL, also depends on the prior knowledge of the model and ignores the actual evidence in the prompt. 3. **The rationality of the generated reasoning chains**: Evaluate whether the reasoning chains generated by LLMs are reasonable, coherent, and whether the labels can be directly derived from the reasoning chains. ### Main findings - **Limited performance improvement**: For complex subjective tasks, CoT does not significantly improve the performance of LLMs, especially on larger models. Smaller models may benefit more from it. - **The influence of prior knowledge**: Even when using CoT, larger LLMs still rely on their internal prior knowledge rather than adjusting according to the reasoning chains in the prompt. - **Generated reasoning chains**: Although the generated reasoning chains are usually reasonable and coherent, they often overlook the subtle meanings in the input, such as sarcasm. ### Conclusion The paper concludes that when dealing with complex subjective tasks, large - language models have difficulty overcoming the strong influence of prior knowledge on posterior prediction even when using CoT. This indicates that although CoT can improve the performance of some small models, in large models, it cannot effectively improve the model's understanding and processing ability for complex subjective tasks.

Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Supervised Chain of Thought

Large Language Models Might Not Care What You Are Saying: Prompt Format Beats Descriptions

Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation

Stress Testing Chain-of-Thought Prompting for Large Language Models

Chain-of-Thought Reasoning Without Prompting

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

Why Can Large Language Models Generate Correct Chain-of-Thoughts?

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

An automatically discovered chain-of-thought prompt generalizes to novel models and datasets

Think Beyond Size: Adaptive Prompting for More Effective Reasoning

Large Language Models are Contrastive Reasoners

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models