A Survey of Confidence Estimation and Calibration in Large Language Models

Jiahui Geng,Fengyu Cai,Yuxia Wang,Heinz Koeppl,Preslav Nakov,Iryna Gurevych
2024-03-25
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work.
Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper "A Survey on Confidence Estimation and Calibration in Large Language Models" aims to address the challenges of confidence estimation and calibration in large language models (LLMs) during generation tasks. Despite the excellent performance of LLMs in various tasks, they may produce factual errors during generation, leading to unreliable results. Evaluating and calibrating the confidence of these models can help mitigate risks and improve generation quality. Specifically, the paper focuses on the following issues: 1. **Confidence Estimation**: - How to evaluate the confidence of LLMs in different tasks? - What are the existing methods for confidence estimation? What are their applicability and limitations in LLMs? 2. **Model Calibration**: - How to calibrate the prediction probabilities of LLMs to align with actual accuracy? - What are the existing calibration methods? How effective are they when applied to LLMs? 3. **Technical Challenges**: - The output space of LLMs is extremely large, with the number of possible results growing exponentially with the length of the generation, making it impossible to evaluate all potential responses. - Different expressions may convey the same meaning, so confidence estimation needs to consider semantics. - LLMs have unique characteristics, such as expressing confidence through text and the ability to perform zero-shot or few-shot learning, but their responses are sensitive to prompts, which may lead to instability in results. 4. **Applications and Future Directions**: - What are the practical applications of confidence estimation and calibration in LLMs? - What are the future research directions and potential development trends? By systematically reviewing and summarizing the current research progress, the paper aims to fill the gap in this field and provide theoretical and technical support for developing more reliable LLM applications.