Calibrating Verbalized Probabilities for Large Language Models

Cheng Wang,Gyuri Szarvas,Georges Balazs,Pavel Danchenko,Patrick Ernst
2024-10-09
Abstract:Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the "logit" by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the calibration problem of large language models (LLMs) when generating probability distributions. Specifically, the author focuses on how to reliably evaluate and utilize the probability distributions generated by black - box large language models (such as Claude, Mistral, etc.), especially in risk - sensitive fields (such as medicine, finance), where decisions require high confidence. ### Main problems 1. **Ability to generate probability distributions**: - Can LLMs generate complete class probability distributions for discriminative tasks, rather than just predicting labels and their confidences? - The author lets LLMs generate probability distributions by designing specific prompt templates and verifies their generation ability. 2. **Impact of Temperature Scaling (TS) on generated probabilities**: - In post - processing calibration methods, temperature scaling is usually performed on unnormalized logits rather than directly on the probabilities after softmax. - If temperature scaling is directly performed on the probabilities generated by LLMs, it will lead to re - applying the softmax function (re - softmaxing), resulting in unreliable calibration results. 3. **Application of the Inverse Softmax Trick**: - To overcome the problem of directly performing temperature scaling on generated probabilities, the author proposes an Inverse Softmax Trick, which converts the generated probabilities into estimated logits and then performs temperature scaling to avoid the re - softmaxing problem. ### Solutions - **Generating probability distributions**: By carefully designed prompt templates, LLMs are required to generate probability distributions instead of simple predicted labels and confidences. - **Identifying and solving the re - softmaxing problem**: Theoretical and empirical analyses show that directly performing temperature scaling on generated probabilities will lead to unsatisfactory calibration effects. Therefore, the author proposes the Inverse Softmax Trick to estimate logits and then perform temperature scaling. - **Experimental verification**: Through extensive experiments on multiple public datasets (IMDB, Emotion, Amazon Massive), the effectiveness of the above - mentioned methods is verified. ### Formula presentation - **Temperature scaling formula**: \[ p_i=\frac{\exp(z_i / \tau)}{\sum_{j = 1}^k\exp(z_j / \tau)}, \quad i\in[1,\dots,k] \] where \(\tau>0\) is the temperature parameter used to adjust the unnormalized activation values (logits) \(z_i\). - **Inverse Softmax Trick**: \[ z_i = \text{INV\_SOFTMAX}(p_i)=\log p_i + c \] where \(c = -\frac{1}{K}\sum_i\log p_i\) is a constant scalar. ### Experimental results The experimental results show that through the Inverse Softmax Trick and temperature scaling, the calibration performance of LLMs in generating probability distributions can be significantly improved, especially on the IMDB, Emotion and Amazon Massive datasets. In conclusion, this paper solves the calibration problem of LLMs in generating probability distributions by introducing the Inverse Softmax Trick, providing a reliable solution for the application of black - box LLMs in risk - sensitive fields.