Larger Language Models Don't Care How You Think: Why Chain-of-Thought Prompting Fails in Subjective Tasks

Georgios Chochlakis,Niyantha Maruthu Pandiyan,Kristina Lerman,Shrikanth Narayanan
2024-10-18
Abstract:In-Context Learning (ICL) in Large Language Models (LLM) has emerged as the dominant technique for performing natural language tasks, as it does not require updating the model parameters with gradient-based methods. ICL promises to "adapt" the LLM to perform the present task at a competitive or state-of-the-art level at a fraction of the computational cost. ICL can be augmented by incorporating the reasoning process to arrive at the final label explicitly in the prompt, a technique called Chain-of-Thought (CoT) prompting. However, recent work has found that ICL relies mostly on the retrieval of task priors and less so on "learning" to perform tasks, especially for complex subjective domains like emotion and morality, where priors ossify posterior predictions. In this work, we examine whether "enabling" reasoning also creates the same behavior in LLMs, wherein the format of CoT retrieves reasoning priors that remain relatively unchanged despite the evidence in the prompt. We find that, surprisingly, CoT indeed suffers from the same posterior collapse as ICL for larger language models. Code is avalaible at <a class="link-external link-https" href="https://github.com/gchochla/cot-priors" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether Chain - of - Thought (CoT) prompts can overcome the strong influence of prior knowledge on posterior prediction in large - language models (LLMs), especially when dealing with complex subjective tasks such as emotion and moral judgment. Specifically, the author explores the following points: 1. **Performance comparison between CoT and traditional in - context learning (ICL)**: Research whether CoT performs better than the traditional ICL method under multi - shot conditions. 2. **The prior knowledge problem of CoT**: Analyze whether CoT, like ICL, also depends on the prior knowledge of the model and ignores the actual evidence in the prompt. 3. **The rationality of the generated reasoning chains**: Evaluate whether the reasoning chains generated by LLMs are reasonable, coherent, and whether the labels can be directly derived from the reasoning chains. ### Main findings - **Limited performance improvement**: For complex subjective tasks, CoT does not significantly improve the performance of LLMs, especially on larger models. Smaller models may benefit more from it. - **The influence of prior knowledge**: Even when using CoT, larger LLMs still rely on their internal prior knowledge rather than adjusting according to the reasoning chains in the prompt. - **Generated reasoning chains**: Although the generated reasoning chains are usually reasonable and coherent, they often overlook the subtle meanings in the input, such as sarcasm. ### Conclusion The paper concludes that when dealing with complex subjective tasks, large - language models have difficulty overcoming the strong influence of prior knowledge on posterior prediction even when using CoT. This indicates that although CoT can improve the performance of some small models, in large models, it cannot effectively improve the model's understanding and processing ability for complex subjective tasks.