Abstract:Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of Flan-ul2, Llama-13b and Mistral-7b on four benchmark Q\&A tasks as well as of Pegasus-large and BART-large on two benchmark summarization tasks with it surpassing baselines by even over $10\%$ (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of confidence estimation in the responses of large language models (LLMs). Specifically, the authors explore methods to estimate the confidence of these models' responses through black-box access or queries. The paper proposes a simple and scalable framework that estimates confidence by designing new features and training an interpretable model (such as a logistic regression model). ### Main Contributions 1. **Proposed a New Framework**: The framework estimates the confidence of large language models by designing new features and training an interpretable model. 2. **Effective Feature Engineering**: The authors proposed 6 different input prompt perturbation strategies that can generate features for confidence estimation. 3. **Extensive Experimental Validation**: Experiments were conducted on multiple benchmark tasks, including 4 question-answering tasks and 2 summarization tasks, demonstrating the effectiveness of the method. 4. **Cross-Model Generalization Ability**: The study found that confidence models built for one LLM can zero-shot generalize to other LLMs, providing the possibility of constructing universal confidence models. ### Method Overview 1. **Prompt Perturbation Strategies**: - **Stochastic Decoding (SD)**: Generate multiple outputs using different decoding strategies (e.g., greedy decoding, beam search, and nucleus sampling). - **Paraphrasing (PP)**: Paraphrase the context in the prompt. - **Sentence Permutation (SP)**: Change the order of named entities in the prompt. - **Entity Frequency Amplification (EFA)**: Repeat sentences containing named entities. - **Stopword Removal (SR)**: Remove stopwords from the context. - **Response Consistency Check (SRC)**: Randomly split the model's output into two parts and check the semantic consistency between them. 2. **Feature Construction**: - **Semantic Sets**: Create semantic equivalence sets based on the semantic similarity of the outputs. - **Lexical Similarity**: Calculate the lexical similarity between outputs. - **SRC Minimum**: Use the contradiction probability of a natural language inference (NLI) model to measure semantic inconsistency between response parts. 3. **Label Creation and Confidence Estimation**: - Create labels by matching the LLM's output with the ground truth responses in the dataset. - Train and predict confidence scores using a logistic regression model. ### Experimental Results - The method significantly outperformed baseline methods on AUROC and AUARC metrics across multiple benchmark tasks. - The performance improvement was particularly notable on the TriviaQA and SQuAD datasets. - Confidence models built for one LLM can zero-shot generalize to other LLMs, showing good cross-model generalization ability. ### Conclusion This paper proposes an effective method for estimating the confidence of large language models through black-box access. The method performs well across multiple benchmark tasks and exhibits good cross-model generalization ability. This research provides new insights for improving the trustworthiness and reliability of large language models.

Large Language Model Confidence Estimation via Black-Box Access

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation

The Confidence-Competence Gap in Large Language Models: A Cognitive Study

The Calibration Gap between Model and Human Confidence in Large Language Models

Enhancing Confidence Expression in Large Language Models Through Learning from Past Experience

Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment.

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment

A Survey of Confidence Estimation and Calibration in Large Language Models

Enhancing Large Language Models' Situated Faithfulness to External Contexts

Large Language Models Must Be Taught to Know What They Don't Know

A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

Methods to Estimate Large Language Model Confidence

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Cycles of Thought: Measuring LLM Confidence through Stable Explanations

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Benchmarking the Confidence of Large Language Models in Clinical Questions

MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust