Abstract:Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.

What problem does this paper attempt to address?

The paper primarily explores the issues that large language models (LLMs) face in expressing their confidence in their answers and proposes a systematic framework to evaluate and improve these models' confidence estimation capabilities. Specifically, the paper aims to address the following key issues: 1. **The Importance of Accurate Confidence Expression**: The study emphasizes that for reliable decision-making, it is crucial for large language models to accurately express their confidence in their answers. 2. **Limitations of Existing Methods**: Current confidence estimation methods mostly rely on white-box access to the model's internal information or model fine-tuning. However, with the proliferation of closed-source large language models (such as GPT-3.5, GPT-4, etc.), these methods have become less applicable. 3. **Need for Black-Box Methods**: Therefore, the paper focuses on exploring new black-box methods to assess the uncertainty of large language models, especially on how to effectively estimate the model's confidence in its answers when direct access to the model's internal structure is not possible. 4. **Systematic Framework**: To better address the problem, the authors define a systematic framework consisting of three components: prompt strategy (to elicit the model's expression of its confidence), sampling strategy (methods for generating multiple responses), and aggregation strategy (to compute consistency). 5. **Experiments and Analysis**: Through experiments on different task types (such as commonsense reasoning, arithmetic reasoning, etc.) and various widely used large language models (including GPT-4 and LLaMA 2 Chat), the paper reveals several important findings: - Large language models tend to be overconfident when expressing their confidence. - As the model's capabilities improve, calibration and failure prediction performance also improve but still fall short of ideal levels. - The proposed human-inspired prompt strategies, response consistency, and better aggregation strategies help mitigate the overconfidence issue. - Although black-box methods perform slightly worse compared to white-box methods, the gap is not significant. In summary, the main goal of the paper is to develop a systematic framework to help large language models more accurately express their confidence in their answers, thereby improving the reliability and trustworthiness of the decision-making process. Additionally, the study points out the limitations of existing methods and directions for future research.

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

On Verbalized Confidence Scores for LLMs

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

Large Language Model Confidence Estimation via Black-Box Access

Can LLMs Learn Uncertainty on Their Own? Expressing Uncertainty Effectively in A Self-Training Manner

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust

Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience

Cycles of Thought: Measuring LLM Confidence through Stable Explanations

A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Benchmarking LLMs via Uncertainty Quantification

Confidence in the Reasoning of Large Language Models

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

The Calibration Gap between Model and Human Confidence in Large Language Models

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Can We Trust LLMs? Mitigate Overconfidence Bias in LLMs through Knowledge Transfer