Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham,Anna Chen,Ansh Radhakrishnan,Benoit Steiner,Carson Denison,Danny Hernandez,Dustin Li,Esin Durmus,Evan Hubinger,Jackson Kernion,Kamilė Lukošiūtė,Karina Nguyen,Newton Cheng,Nicholas Joseph,Nicholas Schiefer,Oliver Rausch,Robin Larson,Sam McCandlish,Sandipan Kundu,Saurav Kadavath,Shannon Yang,Thomas Henighan,Timothy Maxwell,Timothy Telleen-Lawton,Tristan Hume,Zac Hatfield-Dodds,Jared Kaplan,Jan Brauner,Samuel R. Bowman,Ethan Perez

2023-07-17

Abstract:Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.

Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is whether the reasoning process of large language models (LLMs) when generating step-by-step reasoning (i.e., "Chain-of-Thought," abbreviated as CoT) genuinely reflects the model's actual reasoning process. Specifically, the authors explore potential unfaithful aspects of CoT reasoning, such as posterior reasoning (i.e., reasoning generated after the conclusion has already been determined), unfaithful reasoning due to increased computation during testing, and encoding information through specific wording. By designing a series of experiments to intervene in CoT and observing the impact of these interventions on the model's final answers, the authors aim to evaluate the faithfulness of CoT. The study finds significant differences in the faithfulness of CoT across different tasks and model sizes, with smaller models generally generating more faithful reasoning than larger models. This research is important for understanding the internal workings of LLMs and their reliability in applications requiring explainability.

Measuring Faithfulness in Chain-of-Thought Reasoning

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

Dissociation of Faithful and Unfaithful Reasoning in LLMs

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Chain-of-Thought Unfaithfulness as Disguised Accuracy

On the Impact of Fine-Tuning on Chain-of-Thought Reasoning

An electronic blood-cell counting machine.

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

How Likely Do LLMs with CoT Mimic Human Reasoning?

The Impact of Reasoning Step Length on Large Language Models

Calibrating Reasoning in Language Models with Internal Consistency

Multimodal Chain-of-Thought Reasoning in Language Models

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Understanding Chain-of-Thought in LLMs through Information Theory

RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models