When is the consistent prediction likely to be a correct prediction?

Alex Nguyen,Dheeraj Mekala,Chengyu Dong,Jingbo Shang
2024-07-08
Abstract:Self-consistency (Wang et al., 2023) suggests that the most consistent answer obtained through large language models (LLMs) is more likely to be correct. In this paper, we challenge this argument and propose a nuanced correction. Our observations indicate that consistent answers derived through more computation i.e. longer reasoning texts, rather than simply the most consistent answer across all outputs, are more likely to be correct. This is predominantly because we demonstrate that LLMs can autonomously produce chain-of-thought (CoT) style reasoning with no custom prompts merely while generating longer responses, which lead to consistent predictions that are more accurate. In the zero-shot setting, by sampling Mixtral-8x7B model multiple times and considering longer responses, we achieve 86% of its self-consistency performance obtained through zero-shot CoT prompting on the GSM8K and MultiArith datasets. Finally, we demonstrate that the probability of LLMs generating a longer response is quite low, highlighting the need for decoding strategies conditioned on output length.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is related to the accuracy of consistent predictions in large - language models (LLMs). Specifically, the paper challenges the view that "the most consistent answer is more likely to be correct" and proposes a more nuanced view. The author observes that by using more computation, that is, generating longer reasoning texts instead of simply choosing the most consistent answer among all outputs, more likely correct answers can be obtained. This is because LLMs can autonomously generate chain - of - thought (CoT) - style reasoning when generating longer responses, leading to more accurate consistent predictions. ### Main research questions: 1. **Accuracy of consistent predictions**: The paper explores whether consistent predictions must be the most accurate predictions when using large - language models. 2. **Role of long reasoning texts**: It studies whether generating longer reasoning texts can improve the accuracy of predictions. 3. **Impact of chain - of - thought (CoT)**: It analyzes the role of chain - of - thought - style reasoning in improving model performance. ### Research methods: - **Experimental setup**: Two open - source pre - trained models, Mixtral - 8x7B and Llama - 2 70B, and multiple datasets (such as GSM8K, MultiArith, AQUA - RAT and SST2) were used for the experiment. - **Sampling strategy**: Reasoning texts were generated through multiple samplings, and answers were selected according to different length thresholds. - **Performance evaluation**: The model performance under different strategies was evaluated through the self - consistency method. ### Main findings: - **Longer reasoning texts are more accurate**: By generating longer reasoning texts, the model's consistent predictions are more likely to be correct. - **Emergence of chain - of - thought**: Chain - of - thought - style reasoning spontaneously emerged in longer reasoning texts, which significantly improved the model's performance. - **Importance of decoding strategies**: Since the model tends to generate shorter texts, a decoding strategy that takes into account the output length is required to generate longer reasoning texts. ### Conclusion: The paper experimentally proves that by generating longer reasoning texts, the performance of zero - shot prompting in reasoning tasks can be significantly improved. This improvement is mainly attributed to the chain - of - thought - style reasoning that spontaneously emerges in longer reasoning texts. However, the model has a low tendency to generate longer texts, so a specific decoding strategy is required to generate longer outputs.