Abstract:Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities across a variety of domains. However, gauging the trustworthiness of responses generated by LLMs remains an open challenge, with limited research on uncertainty quantification (UQ) for NLG. Furthermore, existing literature typically assumes white-box access to language models, which is becoming unrealistic either due to the closed-source nature of the latest LLMs or computational constraints. In this work, we investigate UQ in NLG for *black-box* LLMs. We first differentiate *uncertainty* vs *confidence*: the former refers to the ``dispersion'' of the potential predictions for a fixed input, and the latter refers to the confidence on a particular prediction/generation. We then propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment. Experiments were carried out with several popular LLMs on question-answering datasets (for evaluation purposes). Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses, providing valuable insights for practitioners on uncertainty management when adopting LLMs. The code to replicate our experiments is available at

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to address the issue of uncertainty quantification (UQ) in natural language generation (NLG) tasks for large language models (LLMs). Specifically: 1. **Confidence Assessment**: How to measure the trustworthiness of responses generated by LLMs is an open challenge. Currently, there is limited research on uncertainty quantification in NLG tasks. 2. **Black-box Models**: Existing literature often assumes white-box access to language models, but with the closed-source nature of the latest LLMs or computational resource limitations, this assumption becomes unrealistic. Therefore, the paper focuses on how to perform uncertainty quantification in black-box LLMs. ### Main Contributions 1. **Exploring Uncertainty Quantification in Black-box LLMs**: The paper explores the importance of uncertainty quantification in black-box LLMs and evaluates its value in selective natural language generation tasks. 2. **Proposing Simple and Effective Uncertainty Estimation Methods**: The paper proposes and compares several simple uncertainty estimation methods that can be used for selective NLG tasks, ignoring unreliable results or further evaluation. 3. **Experimental Validation**: Through extensive experiments on multiple popular LLMs and question-answering datasets, the paper finds that the proposed semantic dispersion measurement method can reliably predict the quality of LLM responses, providing valuable insights for practitioners. ### Background and Related Work - **Uncertainty Quantification**: In machine learning, uncertainty quantification is an important research area. Reliable uncertainty measures are crucial for deciding when to trust a model. - **Selective Classification**: Similar to selective classification, selective NLG can reject high-uncertainty generated results based on uncertainty estimates, which is particularly important in high-risk applications such as healthcare or law. - **Existing Challenges**: Uncertainty quantification in NLG tasks faces specific challenges, such as the high dimensionality of the output space and the fact that different word sequences may convey the same meaning. ### Methodology 1. **Generate Multiple Responses**: For a given input, generate multiple response samples. 2. **Calculate Similarity**: Calculate pairwise similarity scores between these responses. 3. **Compute Uncertainty Estimates**: Use the similarity values to compute uncertainty estimates or confidence scores. ### Experiments and Results - **Datasets**: Experiments were conducted using datasets such as CoQA, TriviaQA, and Natural Questions. - **Models**: Models tested include OPT, LLaMA, LLaMA2, and OpenAI's gpt-3.5-turbo. - **Baseline Methods**: Baseline methods include NumSet, Deg, Ecc, and EigV, and comparisons were made with existing white-box baseline methods. - **Evaluation Metrics**: Metrics such as AUROC and AUARC were used to evaluate the quality of uncertainty quantification. ### Conclusion The paper validates the effectiveness of the proposed uncertainty estimation methods through extensive experiments, particularly in identifying challenging questions and predicting the quality of corresponding answers. These methods provide valuable references for practitioners in managing uncertainty when adopting LLMs.

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation

Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

Look before you leap: An exploratory study of uncertainty measurement for large language models

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

Benchmarking LLMs via Uncertainty Quantification

LUQ: Long-text Uncertainty Quantification for LLMs

ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

On Verbalized Confidence Scores for LLMs

Large Language Model Confidence Estimation via Black-Box Access

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

Uncertainty in Language Models: Assessment through Rank-Calibration

Rethinking Uncertainty Estimation in Natural Language Generation

Quantifying Uncertainty in Natural Language Explanations of Large Language Models

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models

CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs