Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step

Zezhong Wang,Xingshan Zeng,Weiwen Liu,Yufei Wang,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong

2024-06-23

Abstract:Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in the mind during the model's reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by reasoning steps required. Furthermore, by analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model's reasoning.

Computation and Language

What problem does this paper attempt to address?

This paper attempts to address the issue of "early answering" in large language models (LLMs), where the model predicts the answer before generating the Chain-of-Thought (CoT). This phenomenon may lead to CoT having limited or unnecessary contributions to the final prediction, making it unreliable to judge the correctness of the model's reasoning process based on the final answer. Specifically, the paper raises the following core questions: 1. If the model already has the answer before generating CoT, is CoT still necessary? 2. Can the correctness of the final answer serve as valid evidence for the correctness of CoT? To address these issues, the authors propose a new method—Chain-of-Probe (CoP)—to detect changes in the model's thinking during the reasoning process. By analyzing the results of CoP, the authors find that for simple tasks, CoT is often unnecessary; for complex tasks, although CoT may change the model's initial choice, it does not always have a positive impact. Additionally, the authors discover that even if the final answer is correct, there may still be errors in the reasoning process. Based on these observations, the paper proposes a strategy to prioritize selecting the answer with the correct reasoning process from multiple candidate answers using CoP scores, thereby improving the reliability of the model's reasoning. Experimental results show that this method can significantly enhance the overall accuracy of the model.

Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Supervised Chain of Thought

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

The Impact of Reasoning Step Length on Large Language Models

CoQ:AN Empirical Framework for Multi-hop Question Answering Empowered by Large Language Models

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Chain of Thoughtlessness? An Analysis of CoT in Planning

Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning

A Theoretical Understanding of Chain-of-Thought: Coherent Reasoning and Error-Aware Demonstration

Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for Knowledge-intensive Question Answering

Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective

Probabilistic Tree-of-thought Reasoning for Answering Knowledge-intensive Complex Questions

How Likely Do LLMs with CoT Mimic Human Reasoning?

Can We Verify Step by Step for Incorrect Answer Detection?

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning

Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs