Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham,Anna Chen,Ansh Radhakrishnan,Benoit Steiner,Carson Denison,Danny Hernandez,Dustin Li,Esin Durmus,Evan Hubinger,Jackson Kernion,Kamilė Lukošiūtė,Karina Nguyen,Newton Cheng,Nicholas Joseph,Nicholas Schiefer,Oliver Rausch,Robin Larson,Sam McCandlish,Sandipan Kundu,Saurav Kadavath,Shannon Yang,Thomas Henighan,Timothy Maxwell,Timothy Telleen-Lawton,Tristan Hume,Zac Hatfield-Dodds,Jared Kaplan,Jan Brauner,Samuel R. Bowman,Ethan Perez
2023-07-17
Abstract:Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is whether the reasoning process of large language models (LLMs) when generating step-by-step reasoning (i.e., "Chain-of-Thought," abbreviated as CoT) genuinely reflects the model's actual reasoning process. Specifically, the authors explore potential unfaithful aspects of CoT reasoning, such as posterior reasoning (i.e., reasoning generated after the conclusion has already been determined), unfaithful reasoning due to increased computation during testing, and encoding information through specific wording. By designing a series of experiments to intervene in CoT and observing the impact of these interventions on the model's final answers, the authors aim to evaluate the faithfulness of CoT. The study finds significant differences in the faithfulness of CoT across different tasks and model sizes, with smaller models generally generating more faithful reasoning than larger models. This research is important for understanding the internal workings of LLMs and their reliability in applications requiring explainability.