Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper "A Survey on Confidence Estimation and Calibration in Large Language Models" aims to address the challenges of confidence estimation and calibration in large language models (LLMs) during generation tasks. Despite the excellent performance of LLMs in various tasks, they may produce factual errors during generation, leading to unreliable results. Evaluating and calibrating the confidence of these models can help mitigate risks and improve generation quality. Specifically, the paper focuses on the following issues: 1. **Confidence Estimation**: - How to evaluate the confidence of LLMs in different tasks? - What are the existing methods for confidence estimation? What are their applicability and limitations in LLMs? 2. **Model Calibration**: - How to calibrate the prediction probabilities of LLMs to align with actual accuracy? - What are the existing calibration methods? How effective are they when applied to LLMs? 3. **Technical Challenges**: - The output space of LLMs is extremely large, with the number of possible results growing exponentially with the length of the generation, making it impossible to evaluate all potential responses. - Different expressions may convey the same meaning, so confidence estimation needs to consider semantics. - LLMs have unique characteristics, such as expressing confidence through text and the ability to perform zero-shot or few-shot learning, but their responses are sensitive to prompts, which may lead to instability in results. 4. **Applications and Future Directions**: - What are the practical applications of confidence estimation and calibration in LLMs? - What are the future research directions and potential development trends? By systematically reviewing and summarizing the current research progress, the paper aims to fill the gap in this field and provide theoretical and technical support for developing more reliable LLM applications.

A Survey of Confidence Estimation and Calibration in Large Language Models

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Calibrating Long-form Generations from Large Language Models

A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

The Calibration Gap between Model and Human Confidence in Large Language Models

Calibrating Large Language Models with Sample Consistency

Graph-based Confidence Calibration for Large Language Models

A Survey on Evaluation of Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

Large Language Model Confidence Estimation via Black-Box Access

MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

On the Calibration of Large Language Models and Alignment

A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

Calibrating Large Language Models Using Their Generations Only

Atomic Calibration of LLMs in Long-Form Generations