Abstract:Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used LLMs including GPT-4 and LLaMA 2 Chat. Our analysis uncovers several key insights: 1) LLMs, when verbalizing their confidence, tend to be overconfident, potentially imitating human patterns of expressing confidence. 2) As model capability scales up, both calibration and failure prediction performance improve. 3) Employing our proposed strategies, such as human-inspired prompts, consistency among multiple responses, and better aggregation strategies can help mitigate this overconfidence from various perspectives. 4) Comparisons with white-box methods indicate that while white-box methods perform better, the gap is narrow, e.g., 0.522 to 0.605 in AUROC. Despite these advancements, none of these techniques consistently outperform others, and all investigated methods struggle in challenging tasks, such as those requiring professional knowledge, indicating significant scope for improvement. We believe this study can serve as a strong baseline and provide insights for eliciting confidence in black-box LLMs.

Probabilities of Chat LLMs Are Miscalibrated but Still Predict Correctness on Multiple-Choice Q&A

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Calibrated Large Language Models for Binary Question Answering

LLMs May Perform MCQA by Selecting the Least Incorrect Option

Leveraging Large Language Models for Multiple Choice Question Answering

On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Reconfidencing LLMs from the Grouping Loss Perspective

Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only

Can multiple-choice questions really be useful in detecting the abilities of LLMs?

When to Trust LLMs: Aligning Confidence with Response Quality

Are Language Model Logits Calibrated?

The Calibration Gap between Model and Human Confidence in Large Language Models

Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions

Language Models can Evaluate Themselves via Probability Discrepancy

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Calibrating Verbalized Probabilities for Large Language Models

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?

Large Language Models Must Be Taught to Know What They Don't Know