Calibrating Large Language Models with Sample Consistency

Qing Lyu,Kumar Shridhar,Chaitanya Malaviya,Li Zhang,Yanai Elazar,Niket Tandon,Marianna Apidianaki,Mrinmaya Sachan,Chris Callison-Burch

2024-02-22

Abstract:Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. We perform an extensive evaluation across various open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency have the potential to enhance model performance. Finally, we offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.

Computation and Language

What problem does this paper attempt to address?

This paper attempts to solve the problem that it is difficult to accurately evaluate the confidence of large - language models (LLMs) during prediction. Specifically, the paper focuses on how to calibrate the confidence of LLMs through the consistency of multiple samples generated by the model to improve their reliability and performance. The paper points out that existing LLMs are usually uncalibrated, and due to their proprietary nature and large - scale characteristics, traditional calibration techniques are often inapplicable or too costly. Therefore, the authors explore methods based on sample consistency to estimate the confidence of LLMs and conduct extensive experimental evaluations through three consistency metrics (agreement degree, entropy, and first - second distance). The study finds that these consistency - based calibration methods significantly outperform existing post - processing calibration baseline methods on multiple datasets. In addition, the paper also explores the effects of intermediate explanations, model scale expansion, and sample quantity increase on the calibration effect, as well as the negative impact of instruction tuning on calibration. Finally, the authors provide practical guidance on selecting consistency measures suitable for the characteristics of different LLMs.

Calibrating Large Language Models with Sample Consistency

Calibrating Long-form Generations from Large Language Models

Graph-based Confidence Calibration for Large Language Models

On the Calibration of Large Language Models and Alignment

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Calibrating Large Language Models Using Their Generations Only

A Survey of Confidence Estimation and Calibration in Large Language Models

The Calibration Gap between Model and Human Confidence in Large Language Models

Multicalibration for Confidence Scoring in LLMs

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment

Large Language Models Must Be Taught to Know What They Don't Know

Atomic Calibration of LLMs in Long-Form Generations

Calibrating Verbalized Probabilities for Large Language Models

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Few-Shot Recalibration of Language Models

Calibrating LLM-Based Evaluator

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Large Language Model Confidence Estimation via Black-Box Access

Does Alignment Tuning Really Break LLMs' Internal Confidence?

Calibrated Large Language Models for Binary Question Answering