Dual Traits in Probabilistic Reasoning of Large Language Models

Shenxiong Li,Huaxia Rui
2024-12-15
Abstract:We conducted three experiments to investigate how large language models (LLMs) evaluate posterior probabilities. Our results reveal the coexistence of two modes in posterior judgment among state-of-the-art models: a normative mode, which adheres to Bayes' rule, and a representative-based mode, which relies on similarity -- paralleling human System 1 and System 2 thinking. Additionally, we observed that LLMs struggle to recall base rate information from their memory, and developing prompt engineering strategies to mitigate representative-based judgment may be challenging. We further conjecture that the dual modes of judgment may be a result of the contrastive loss function employed in reinforcement learning from human feedback. Our findings underscore the potential direction for reducing cognitive biases in LLMs and the necessity for cautious deployment of LLMs in critical areas.
Artificial Intelligence,Computation and Language,Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: the behavior and bias of large language models (LLMs) when evaluating posterior probabilities. Specifically, through three experiments, the researchers explored the probability reasoning patterns of LLMs under different conditions and revealed two co - existing reasoning patterns: the normative mode, which follows Bayes' rule; and the representative - based mode, which relies on similarity judgments. In addition, the study also found that LLMs have difficulty recalling base - probability information from memory and, in some cases, exhibit cognitive biases similar to those of humans. ### Research Background With the wide application of large language models (LLMs) in fields such as academia, law, medicine, and finance, it has become particularly important to accurately evaluate the posterior probability \(P(H|E)\) of hypothesis \(H\) given evidence \(E\). This study aims to explore how LLMs judge this posterior probability and reveal their underlying cognitive patterns. ### Main Questions 1. **Dual - mode Reasoning**: Do LLMs have two different reasoning patterns when evaluating posterior probabilities? Are these patterns similar to human System 1 (intuitive thinking) and System 2 (analytical thinking)? 2. **Recall of Base Probability**: Can LLMs correctly recall and apply base - probability information? 3. **Influence of Prompt Engineering**: How can prompt engineering be used to reduce the representative - judgment bias of LLMs? ### Experimental Design To answer the above questions, the researchers designed three experiments: - **Structured Test**: Provide all necessary information to evaluate whether LLMs can follow Bayes' rule under ideal conditions. - **Semi - structured Test**: Restrict some information (such as the diagnosticity of evidence) to observe the performance of LLMs under incomplete information. - **Unstructured Test**: Rely entirely on the memory and reasoning ability of LLMs to evaluate their performance without clear guidance. ### Main Findings 1. **Co - existence of Dual Modes**: LLMs do have two modes when evaluating posterior probabilities: one is the normative mode that follows Bayes' rule, and the other is the representative mode that relies on similarity. 2. **Low Sensitivity to Base Probability**: LLMs are not sensitive to changes in base probability, indicating that their reasoning process relies more on representativeness than normativeness. 3. **Importance of Prompt Engineering**: By adjusting the prompts, LLMs can be guided to be more inclined to use Bayes' rule for reasoning, but the effect is limited, especially in unstructured tests. ### Practical Significance The research results emphasize the necessity of carefully deploying LLMs in key areas (such as medical diagnosis, legal judgment, etc.). Users should be aware that the probability reasoning of LLMs may be biased, especially in tasks involving base probability and complex reasoning. Future research should focus on developing more effective training methods and prompt strategies to reduce these biases and improve the reliability of LLMs. ### Conclusion This study shows that LLMs exhibit a dual - cognitive pattern similar to that of humans in probability reasoning, but there are significant deficiencies in processing base - probability information. Understanding these patterns is helpful for improving the design and application of LLMs and ensuring their reliability and accuracy in practical applications.