Abstract:Large Language Models (LLMs) tend to be unreliable in the factuality of their answers. To address this problem, NLP researchers have proposed a range of techniques to estimate LLM's confidence over facts. However, due to the lack of a systematic comparison, it is not clear how the different methods compare to one another. To fill this gap, we present a survey and empirical comparison of estimators of factual confidence. We define an experimental framework allowing for fair comparison, covering both fact-verification and question answering. Our experiments across a series of LLMs indicate that trained hidden-state probes provide the most reliable confidence estimates, albeit at the expense of requiring access to weights and training data. We also conduct a deeper assessment of factual confidence by measuring the consistency of model behavior under meaning-preserving variations in the input. We find that the confidence of LLMs is often unstable across semantically equivalent inputs, suggesting that there is much room for improvement of the stability of models' parametric knowledge. Our code is available at (<a class="link-external link-https" href="https://github.com/amazon-science/factual-confidence-of-llms" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of lack of authenticity in information generation by large - language models (LLMs). Specifically, the paper focuses on the following two aspects: 1. **Factual reliability**: LLMs sometimes generate incorrect information or report uncertain facts with confidence. This behavior may lead to the spread of misleading information and damage users' trust. 2. **Comparison of fact - confidence estimation methods**: Although various techniques have been proposed to estimate the fact - confidence of LLMs, there is a lack of systematic comparison among these methods, resulting in an unclear understanding of which methods are more reliable and robust. ### Solutions To address the above problems, the paper conducts research in the following aspects: 1. **Define an experimental framework**: The paper proposes an experimental framework that allows for a fair comparison of various fact - confidence estimation methods on different models and datasets. This framework covers two tasks: fact - verification and question - answering. 2. **Classification and evaluation of methods**: The paper classifies existing fact - confidence estimation methods into five categories: trained probes, sequence probability, verbalization, surrogate token probability, and consistency. Through experiments, the paper evaluates the performance of these methods on multiple publicly available LLMs. 3. **Impact of input variations**: The paper also explores the performance consistency of LLMs under semantically equivalent inputs. The study finds that the confidence of LLMs is often unstable in the face of semantically equivalent inputs, indicating that there is still much room for improvement in the stability of the model's parametric knowledge. ### Main contributions 1. **Literature review**: The paper reviews the current methods for estimating the fact - confidence of LLMs. 2. **Experimental framework**: Provides an experimental framework that makes the comparison between different methods more fair and systematic. 3. **Method evaluation**: Provides insights into the reliability and robustness of different methods through experiments and offers recommendations for natural - language - processing (NLP) practitioners. ### Conclusions Through systematic experiments and analysis, the paper finds that the trained - probes method is the most reliable in estimating fact - confidence, but it requires access to model weights and training data. For instruction - tuned LLMs, verbalization and consistency - based methods are also viable options. In addition, the paper emphasizes the importance of testing the consistency of the model under different semantically equivalent inputs to ensure that the model maintains stable fact - knowledge encoding in diverse input variations.

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

LLM Factoscope: Uncovering LLMs' Factual Discernment through Inner States Analysis

Large Language Model Confidence Estimation via Black-Box Access

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

The Factuality of Large Language Models in the Legal Domain

Long-form factuality in large language models

Reconfidencing LLMs from the Grouping Loss Perspective

Assessing the Reliability of Large Language Model Knowledge

Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models

LM vs LM: Detecting Factual Errors via Cross Examination

Distinguishing the Knowable from the Unknowable with Language Models

Statistical Knowledge Assessment for Large Language Models

Factuality of Large Language Models: A Survey

OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs Via Ontology-Driven Reinforcement Learning

TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall