Calibrating Large Language Models with Sample Consistency

Qing Lyu,Kumar Shridhar,Chaitanya Malaviya,Li Zhang,Yanai Elazar,Niket Tandon,Marianna Apidianaki,Mrinmaya Sachan,Chris Callison-Burch
2024-02-22
Abstract:Accurately gauging the confidence level of Large Language Models' (LLMs) predictions is pivotal for their reliable application. However, LLMs are often uncalibrated inherently and elude conventional calibration techniques due to their proprietary nature and massive scale. In this work, we explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. We perform an extensive evaluation across various open and closed-source models on nine reasoning datasets. Results show that consistency-based calibration methods outperform existing post-hoc approaches. Meanwhile, we find that factors such as intermediate explanations, model scaling, and larger sample sizes enhance calibration, while instruction-tuning makes calibration more difficult. Moreover, confidence scores obtained from consistency have the potential to enhance model performance. Finally, we offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve the problem that it is difficult to accurately evaluate the confidence of large - language models (LLMs) during prediction. Specifically, the paper focuses on how to calibrate the confidence of LLMs through the consistency of multiple samples generated by the model to improve their reliability and performance. The paper points out that existing LLMs are usually uncalibrated, and due to their proprietary nature and large - scale characteristics, traditional calibration techniques are often inapplicable or too costly. Therefore, the authors explore methods based on sample consistency to estimate the confidence of LLMs and conduct extensive experimental evaluations through three consistency metrics (agreement degree, entropy, and first - second distance). The study finds that these consistency - based calibration methods significantly outperform existing post - processing calibration baseline methods on multiple datasets. In addition, the paper also explores the effects of intermediate explanations, model scale expansion, and sample quantity increase on the calibration effect, as well as the negative impact of instruction tuning on calibration. Finally, the authors provide practical guidance on selecting consistency measures suitable for the characteristics of different LLMs.