Abstract:Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is related to the accuracy of consistent predictions in large - language models (LLMs). Specifically, the paper challenges the view that "the most consistent answer is more likely to be correct" and proposes a more nuanced view. The author observes that by using more computation, that is, generating longer reasoning texts instead of simply choosing the most consistent answer among all outputs, more likely correct answers can be obtained. This is because LLMs can autonomously generate chain - of - thought (CoT) - style reasoning when generating longer responses, leading to more accurate consistent predictions. ### Main research questions: 1. **Accuracy of consistent predictions**: The paper explores whether consistent predictions must be the most accurate predictions when using large - language models. 2. **Role of long reasoning texts**: It studies whether generating longer reasoning texts can improve the accuracy of predictions. 3. **Impact of chain - of - thought (CoT)**: It analyzes the role of chain - of - thought - style reasoning in improving model performance. ### Research methods: - **Experimental setup**: Two open - source pre - trained models, Mixtral - 8x7B and Llama - 2 70B, and multiple datasets (such as GSM8K, MultiArith, AQUA - RAT and SST2) were used for the experiment. - **Sampling strategy**: Reasoning texts were generated through multiple samplings, and answers were selected according to different length thresholds. - **Performance evaluation**: The model performance under different strategies was evaluated through the self - consistency method. ### Main findings: - **Longer reasoning texts are more accurate**: By generating longer reasoning texts, the model's consistent predictions are more likely to be correct. - **Emergence of chain - of - thought**: Chain - of - thought - style reasoning spontaneously emerged in longer reasoning texts, which significantly improved the model's performance. - **Importance of decoding strategies**: Since the model tends to generate shorter texts, a decoding strategy that takes into account the output length is required to generate longer reasoning texts. ### Conclusion: The paper experimentally proves that by generating longer reasoning texts, the performance of zero - shot prompting in reasoning tasks can be significantly improved. This improvement is mainly attributed to the chain - of - thought - style reasoning that spontaneously emerges in longer reasoning texts. However, the model has a low tendency to generate longer texts, so a specific decoding strategy is required to generate longer outputs.

When is the consistent prediction likely to be a correct prediction?

Self-Consistency of Large Language Models under Ambiguity

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Universal Self-Consistency for Large Language Model Generation

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Large Language Models have Intrinsic Self-Correction Ability

Large Language Models Cannot Self-Correct Reasoning Yet

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Semantic Consistency for Assuring Reliability of Large Language Models

Lachesis: Predicting LLM Inference Accuracy using Structural Properties of Reasoning Paths

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Self-Consistency Improves Chain of Thought Reasoning in Language Models

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

Calibrating Reasoning in Language Models with Internal Consistency

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Large Language Models Can Self-Correct with Key Condition Verification

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions