Abstract:Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations. Furthermore, we explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore and solve the **calibration problem of language models (LMs) in in - context learning (ICL)**. Specifically, the author focuses on how to balance the relationship between the performance and calibration of the model when using the ICL method. The following are the main research questions of the paper: 1. **Accurate uncertainty quantification**: - Accurate uncertainty quantification is crucial for the safe deployment of machine - learning models. Although modern language models perform well, in some cases they produce incorrect or over - confident outputs, resulting in poor calibration. 2. **Calibration - performance balance in ICL**: - Researchers have observed through a large number of experiments that as the number of ICL examples increases, the model initially shows a greater degree of miscalibration and then gradually improves. Especially in low - shot settings, this miscalibration phenomenon is particularly obvious. 3. **The impact of methods for improving usability on calibration**: - The paper finds that methods aimed at improving the usability of the model (such as fine - tuning and chain - of - thought prompting) may lead to miscalibration, resulting in unreliable natural - language explanations. 4. **The application of recalibration techniques**: - The author explores different recalibration techniques and finds that the scaling - binning calibrator can effectively reduce calibration errors. 5. **Calibration problems in reasoning tasks**: - In reasoning tasks involving generating explanations, the model may produce confident but incorrect answers. The author demonstrates this phenomenon through reliability diagrams and confidence histograms. 6. **The effects of different prompting strategies**: - The research also explores the effects of different prompting strategies (such as repeating the context, repeating the prompt, etc.) on the model performance and calibration, and finds that prompts containing labels can significantly reduce uncertainty and improve learning performance. ### Summary The core problem of this paper is **how to achieve accurate calibration of language models in in - context learning**, especially in different task types (such as text classification and reasoning tasks) and different sample - size settings. Through in - depth analysis and experiments, the author reveals the calibration challenges in ICL and proposes some effective solutions to ensure the reliability and safety of the model in practical applications. ### Formula presentation Some of the key formulas involved in the paper are as follows: - **Classical calibration definition**: \[ P(Y = y|P_\theta(X)=p)=p_y \] where \( P_\theta \) is the prediction probability distribution with model parameters \( \theta \), \( p \) is the predicted probability distribution, and \( y \) is the true label. - **Confidence calibration definition**: \[ P(Y = c(X)|\max P_\theta(X)=p^*) = p^* \] where \( c(X)=\arg\max p\), that is, the category with the highest predicted probability. - **Expected calibration error (ECE)**: \[ ECE=\sum_{m = 1}^{M}\frac{|B_m|}{n}|\text{acc}(B_m)-\text{conf}(B_m)| \] where \( B_m \) is the \( m\) - th confidence interval, \( n \) is the total number of samples, \( \text{acc}(B_m) \) is the accuracy within the interval, and \( \text{conf}(B_m) \) is the average confidence within the interval. These formulas help researchers quantify and evaluate the calibration performance of language models.

A Study on the Calibration of In-context Learning

Calibrate to Discriminate: Improve In-Context Learning with Label-Free Comparative Inference

On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning

Generative Calibration for In-context Learning

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

NoisyICL: A Little Noise in Model Parameters Calibrates In-context Learning

Task Calibration: Calibrating Large Language Models on Inference Tasks

A Close Look into the Calibration of Pre-trained Language Models.

Uncertainty Quantification for In-Context Learning of Large Language Models

When Does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks

Calibration of Continual Learning Models

On the Inference Calibration of Neural Machine Translation

The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis

Enhancing In-context Learning via Linear Probe Calibration

Calibrating Long-form Generations from Large Language Models

The Calibration Gap between Model and Human Confidence in Large Language Models

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

An Empirical Study Into What Matters for Calibrating Vision-Language Models

A Survey on In-context Learning