Abstract:Calibrating verbalized probabilities presents a novel approach for reliably assessing and leveraging outputs from black-box Large Language Models (LLMs). Recent methods have demonstrated improved calibration by applying techniques like Platt scaling or temperature scaling to the confidence scores generated by LLMs. In this paper, we explore the calibration of verbalized probability distributions for discriminative tasks. First, we investigate the capability of LLMs to generate probability distributions over categorical labels. We theoretically and empirically identify the issue of re-softmax arising from the scaling of verbalized probabilities, and propose using the invert softmax trick to approximate the "logit" by inverting verbalized probabilities. Through extensive evaluation on three public datasets, we demonstrate: (1) the robust capability of LLMs in generating class distributions, and (2) the effectiveness of the invert softmax trick in estimating logits, which, in turn, facilitates post-calibration adjustments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the calibration problem of large language models (LLMs) when generating probability distributions. Specifically, the author focuses on how to reliably evaluate and utilize the probability distributions generated by black - box large language models (such as Claude, Mistral, etc.), especially in risk - sensitive fields (such as medicine, finance), where decisions require high confidence. ### Main problems 1. **Ability to generate probability distributions**: - Can LLMs generate complete class probability distributions for discriminative tasks, rather than just predicting labels and their confidences? - The author lets LLMs generate probability distributions by designing specific prompt templates and verifies their generation ability. 2. **Impact of Temperature Scaling (TS) on generated probabilities**: - In post - processing calibration methods, temperature scaling is usually performed on unnormalized logits rather than directly on the probabilities after softmax. - If temperature scaling is directly performed on the probabilities generated by LLMs, it will lead to re - applying the softmax function (re - softmaxing), resulting in unreliable calibration results. 3. **Application of the Inverse Softmax Trick**: - To overcome the problem of directly performing temperature scaling on generated probabilities, the author proposes an Inverse Softmax Trick, which converts the generated probabilities into estimated logits and then performs temperature scaling to avoid the re - softmaxing problem. ### Solutions - **Generating probability distributions**: By carefully designed prompt templates, LLMs are required to generate probability distributions instead of simple predicted labels and confidences. - **Identifying and solving the re - softmaxing problem**: Theoretical and empirical analyses show that directly performing temperature scaling on generated probabilities will lead to unsatisfactory calibration effects. Therefore, the author proposes the Inverse Softmax Trick to estimate logits and then perform temperature scaling. - **Experimental verification**: Through extensive experiments on multiple public datasets (IMDB, Emotion, Amazon Massive), the effectiveness of the above - mentioned methods is verified. ### Formula presentation - **Temperature scaling formula**: \[ p_i=\frac{\exp(z_i / \tau)}{\sum_{j = 1}^k\exp(z_j / \tau)}, \quad i\in[1,\dots,k] \] where \(\tau>0\) is the temperature parameter used to adjust the unnormalized activation values (logits) \(z_i\). - **Inverse Softmax Trick**: \[ z_i = \text{INV\_SOFTMAX}(p_i)=\log p_i + c \] where \(c = -\frac{1}{K}\sum_i\log p_i\) is a constant scalar. ### Experimental results The experimental results show that through the Inverse Softmax Trick and temperature scaling, the calibration performance of LLMs in generating probability distributions can be significantly improved, especially on the IMDB, Emotion and Amazon Massive datasets. In conclusion, this paper solves the calibration problem of LLMs in generating probability distributions by introducing the Inverse Softmax Trick, providing a reliable solution for the application of black - box LLMs in risk - sensitive fields.

Calibrating Verbalized Probabilities for Large Language Models

Calibrated Large Language Models for Binary Question Answering

Are Language Model Logits Calibrated?

Calibrating Large Language Models with Sample Consistency

The Calibration Gap between Model and Human Confidence in Large Language Models

Calibrating Large Language Models Using Their Generations Only

On Verbalized Confidence Scores for LLMs

Verbalized Probabilistic Graphical Modeling with Large Language Models

A Survey of Calibration Process for Black-Box LLMs

Calibrating Long-form Generations from Large Language Models

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

A Survey of Confidence Estimation and Calibration in Large Language Models

Large Language Models Must Be Taught to Know What They Don't Know

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

Incoherent Probability Judgments in Large Language Models

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Graph-based Confidence Calibration for Large Language Models

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence?