Abstract:We conducted three experiments to investigate how large language models (LLMs) evaluate posterior probabilities. Our results reveal the coexistence of two modes in posterior judgment among state-of-the-art models: a normative mode, which adheres to Bayes' rule, and a representative-based mode, which relies on similarity -- paralleling human System 1 and System 2 thinking. Additionally, we observed that LLMs struggle to recall base rate information from their memory, and developing prompt engineering strategies to mitigate representative-based judgment may be challenging. We further conjecture that the dual modes of judgment may be a result of the contrastive loss function employed in reinforcement learning from human feedback. Our findings underscore the potential direction for reducing cognitive biases in LLMs and the necessity for cautious deployment of LLMs in critical areas.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: the behavior and bias of large language models (LLMs) when evaluating posterior probabilities. Specifically, through three experiments, the researchers explored the probability reasoning patterns of LLMs under different conditions and revealed two co - existing reasoning patterns: the normative mode, which follows Bayes' rule; and the representative - based mode, which relies on similarity judgments. In addition, the study also found that LLMs have difficulty recalling base - probability information from memory and, in some cases, exhibit cognitive biases similar to those of humans. ### Research Background With the wide application of large language models (LLMs) in fields such as academia, law, medicine, and finance, it has become particularly important to accurately evaluate the posterior probability \(P(H|E)\) of hypothesis \(H\) given evidence \(E\). This study aims to explore how LLMs judge this posterior probability and reveal their underlying cognitive patterns. ### Main Questions 1. **Dual - mode Reasoning**: Do LLMs have two different reasoning patterns when evaluating posterior probabilities? Are these patterns similar to human System 1 (intuitive thinking) and System 2 (analytical thinking)? 2. **Recall of Base Probability**: Can LLMs correctly recall and apply base - probability information? 3. **Influence of Prompt Engineering**: How can prompt engineering be used to reduce the representative - judgment bias of LLMs? ### Experimental Design To answer the above questions, the researchers designed three experiments: - **Structured Test**: Provide all necessary information to evaluate whether LLMs can follow Bayes' rule under ideal conditions. - **Semi - structured Test**: Restrict some information (such as the diagnosticity of evidence) to observe the performance of LLMs under incomplete information. - **Unstructured Test**: Rely entirely on the memory and reasoning ability of LLMs to evaluate their performance without clear guidance. ### Main Findings 1. **Co - existence of Dual Modes**: LLMs do have two modes when evaluating posterior probabilities: one is the normative mode that follows Bayes' rule, and the other is the representative mode that relies on similarity. 2. **Low Sensitivity to Base Probability**: LLMs are not sensitive to changes in base probability, indicating that their reasoning process relies more on representativeness than normativeness. 3. **Importance of Prompt Engineering**: By adjusting the prompts, LLMs can be guided to be more inclined to use Bayes' rule for reasoning, but the effect is limited, especially in unstructured tests. ### Practical Significance The research results emphasize the necessity of carefully deploying LLMs in key areas (such as medical diagnosis, legal judgment, etc.). Users should be aware that the probability reasoning of LLMs may be biased, especially in tasks involving base probability and complex reasoning. Future research should focus on developing more effective training methods and prompt strategies to reduce these biases and improve the reliability of LLMs. ### Conclusion This study shows that LLMs exhibit a dual - cognitive pattern similar to that of humans in probability reasoning, but there are significant deficiencies in processing base - probability information. Understanding these patterns is helpful for improving the design and application of LLMs and ensuring their reliability and accuracy in practical applications.

Dual Traits in Probabilistic Reasoning of Large Language Models

Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Incoherent Probability Judgments in Large Language Models

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Dual Process Theory for Large Language Models: An overview of using Psychology to address hallucination and reliability issues

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Large Language Models Are Not Robust Multiple Choice Selectors.

Conditional and Modal Reasoning in Large Language Models

Verbalized Probabilistic Graphical Modeling with Large Language Models

Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks

Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

(Ir)rationality and Cognitive Biases in Large Language Models

Bayesian Statistical Modeling with Predictors from LLMs

Balancing Rigor and Utility: Mitigating Cognitive Biases in Large Language Models for Multiple-Choice Questions

Towards Logically Consistent Language Models via Probabilistic Reasoning

Analysis of hybrid imaging techniques

CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks