Abstract:Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Distinguishing the Knowable from the Unknowable with Language Models

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

Confidence in the Reasoning of Large Language Models

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Do Large Language Models Know What They Don't Know?

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning

To Believe or Not to Believe Your LLM

Can LLMs Learn Uncertainty on Their Own? Expressing Uncertainty Effectively in A Self-Training Manner

Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

Large Language Model Confidence Estimation via Black-Box Access

Epistemic Integrity in Large Language Models

Enhancing Trust in Large Language Models with Uncertainty-Aware Fine-Tuning

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

Perceptions of Linguistic Uncertainty by Language Models and Humans