Abstract:Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to formalize and evaluate the Chain - of - Thought (CoT) reasoning process in large - language models (LLMs) through information - theoretic methods. Specifically, the paper addresses the following key issues: 1. **Limitations of existing CoT evaluation methods**: - **Requirement for annotated data**: Existing CoT evaluation techniques usually rely on manually - annotated CoT data, which is both expensive and time - consuming. - **Inability to accurately evaluate intermediate steps**: Existing methods are insufficient in evaluating intermediate reasoning steps, resulting in a high false - positive rate. 2. **Lack of in - depth understanding of intermediate reasoning steps**: - Existing methods mainly focus on the correctness of the final output and fail to provide a detailed assessment of the quality of each reasoning step, making it difficult to identify failure modes in the model's reasoning process. ### New methods proposed in the paper To solve the above problems, the paper proposes a new information - theoretic - based framework that can quantify the information gain of each reasoning step, thereby identifying failure modes in the LLMs' reasoning process without the need for expensive annotated data. Specific contributions are as follows: 1. **Development of a new framework**: - This framework can describe and detect failure modes in the CoT reasoning process in LLMs, providing a strict language to analyze the quality of each reasoning step. 2. **Proposal of a practical algorithm**: - This algorithm can evaluate the model's performance on various subtasks without relying on annotated intermediate - reasoning - step data, providing more fine - grained CoT performance information. 3. **Verification of the effectiveness of the method**: - Through extensive experiments on toy datasets and the GSM - 8K dataset, it is proven that this method is superior to existing final - result - based evaluation methods (such as outcome reward modeling and Math - Shepherd) in identifying failure modes in CoT reasoning. These methods rely on final accuracy and are prone to increasing the false - positive rate in error detection. ### Main conclusions By introducing the concept of information gain, the paper provides a new method for evaluating the CoT reasoning process in LLMs without the need for annotated data. This method can not only identify errors in the reasoning process more accurately but also provide valuable insights for improving the model's reasoning ability.

Understanding Chain-of-Thought in LLMs through Information Theory

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

How Likely Do LLMs with CoT Mimic Human Reasoning?

Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Supervised Chain of Thought

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Chain-of-Thought in Neural Code Generation: From and For Lightweight Language Models

Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners

Multimodal Chain-of-Thought Reasoning in Language Models

On the Empirical Complexity of Reasoning and Planning in LLMs

Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Extending Token Computation for LLM Reasoning

An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs

On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning

Markovian Transformers for Informative Language Modeling

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding