Abstract:To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.

What problem does this paper attempt to address?

This paper attempts to solve the calibration problem of large - language models (LLMs) in generating long texts. Specifically, traditional calibration methods are usually based on binary true/false evaluations to calibrate the confidence of the model, and this method is not suitable for long - text generation tasks because the answers of long texts are often partially correct rather than completely correct or completely wrong. Therefore, the paper proposes a unified calibration framework, aiming to deal with the confidence calibration problem in long - text generation tasks, which is achieved by regarding the correctness of the model's answers and their corresponding confidence levels as distributions within a score range. The main contributions of the paper include: 1. **Proposing a general calibration framework**: This framework is applicable to text - generation tasks, whether it is long - text or short - text generation, and can be calibrated by representing the distributions of correctness and confidence. 2. **Innovating methods of confidence extraction and calibration measurement**: These methods are applied to multiple LLMs to improve the calibration performance of the models. 3. **Providing evidence that fine - tuning the model and adjusting the temperature can improve calibration**. 4. **Showing a practical application case**: Optimizing the cost - effectiveness of long - text generation through a selective answering strategy, that is, after the initial query is processed by an open - source model, it is determined whether a more advanced API model needs to be involved according to its confidence level, so as to ensure cost - efficiency while maintaining high performance. The paper verifies the effectiveness of the proposed framework and methods through experiments, especially performing well in different types of long - text Q&A tasks and summarization tasks. In addition, the study also finds that larger models do not necessarily have better calibration performance, and different calibration indicators can be used complementarily to provide a more comprehensive model evaluation.

Calibrating Long-form Generations from Large Language Models

Calibrating Long-form Generations from Large Language Models

Linguistic Calibration of Long-Form Generations

Atomic Calibration of LLMs in Long-Form Generations

Calibrating Large Language Models with Sample Consistency

Calibrating Large Language Models Using Their Generations Only

Calibrating LLM-Based Evaluator

The Calibration Gap between Model and Human Confidence in Large Language Models

Graph-based Confidence Calibration for Large Language Models

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

On the Calibration of Large Language Models and Alignment

LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

A Survey of Confidence Estimation and Calibration in Large Language Models

Multicalibration for Confidence Scoring in LLMs

Generative Calibration for In-context Learning

Calibration and Correctness of Language Models for Code

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Calibrated Large Language Models for Binary Question Answering

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Self-Evaluation Improves Selective Generation in Large Language Models

Fact-Level Confidence Calibration and Self-Correction