A Study on the Calibration of In-context Learning

Hanlin Zhang,Yi-Fan Zhang,Yaodong Yu,Dhruv Madeka,Dean Foster,Eric Xing,Himabindu Lakkaraju,Sham Kakade
2024-03-28
Abstract:Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations. Furthermore, we explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to explore and solve the **calibration problem of language models (LMs) in in - context learning (ICL)**. Specifically, the author focuses on how to balance the relationship between the performance and calibration of the model when using the ICL method. The following are the main research questions of the paper: 1. **Accurate uncertainty quantification**: - Accurate uncertainty quantification is crucial for the safe deployment of machine - learning models. Although modern language models perform well, in some cases they produce incorrect or over - confident outputs, resulting in poor calibration. 2. **Calibration - performance balance in ICL**: - Researchers have observed through a large number of experiments that as the number of ICL examples increases, the model initially shows a greater degree of miscalibration and then gradually improves. Especially in low - shot settings, this miscalibration phenomenon is particularly obvious. 3. **The impact of methods for improving usability on calibration**: - The paper finds that methods aimed at improving the usability of the model (such as fine - tuning and chain - of - thought prompting) may lead to miscalibration, resulting in unreliable natural - language explanations. 4. **The application of recalibration techniques**: - The author explores different recalibration techniques and finds that the scaling - binning calibrator can effectively reduce calibration errors. 5. **Calibration problems in reasoning tasks**: - In reasoning tasks involving generating explanations, the model may produce confident but incorrect answers. The author demonstrates this phenomenon through reliability diagrams and confidence histograms. 6. **The effects of different prompting strategies**: - The research also explores the effects of different prompting strategies (such as repeating the context, repeating the prompt, etc.) on the model performance and calibration, and finds that prompts containing labels can significantly reduce uncertainty and improve learning performance. ### Summary The core problem of this paper is **how to achieve accurate calibration of language models in in - context learning**, especially in different task types (such as text classification and reasoning tasks) and different sample - size settings. Through in - depth analysis and experiments, the author reveals the calibration challenges in ICL and proposes some effective solutions to ensure the reliability and safety of the model in practical applications. ### Formula presentation Some of the key formulas involved in the paper are as follows: - **Classical calibration definition**: \[ P(Y = y|P_\theta(X)=p)=p_y \] where \( P_\theta \) is the prediction probability distribution with model parameters \( \theta \), \( p \) is the predicted probability distribution, and \( y \) is the true label. - **Confidence calibration definition**: \[ P(Y = c(X)|\max P_\theta(X)=p^*) = p^* \] where \( c(X)=\arg\max p\), that is, the category with the highest predicted probability. - **Expected calibration error (ECE)**: \[ ECE=\sum_{m = 1}^{M}\frac{|B_m|}{n}|\text{acc}(B_m)-\text{conf}(B_m)| \] where \( B_m \) is the \( m\) - th confidence interval, \( n \) is the total number of samples, \( \text{acc}(B_m) \) is the accuracy within the interval, and \( \text{conf}(B_m) \) is the average confidence within the interval. These formulas help researchers quantify and evaluate the calibration performance of language models.