Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Matéo Mahaut,Laura Aina,Paula Czarnowska,Momchil Hardalov,Thomas Müller,Lluís Màrquez
2024-06-19
Abstract:Large Language Models (LLMs) tend to be unreliable in the factuality of their answers. To address this problem, NLP researchers have proposed a range of techniques to estimate LLM's confidence over facts. However, due to the lack of a systematic comparison, it is not clear how the different methods compare to one another. To fill this gap, we present a survey and empirical comparison of estimators of factual confidence. We define an experimental framework allowing for fair comparison, covering both fact-verification and question answering. Our experiments across a series of LLMs indicate that trained hidden-state probes provide the most reliable confidence estimates, albeit at the expense of requiring access to weights and training data. We also conduct a deeper assessment of factual confidence by measuring the consistency of model behavior under meaning-preserving variations in the input. We find that the confidence of LLMs is often unstable across semantically equivalent inputs, suggesting that there is much room for improvement of the stability of models' parametric knowledge. Our code is available at (<a class="link-external link-https" href="https://github.com/amazon-science/factual-confidence-of-llms" rel="external noopener nofollow">this https URL</a>).
Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of lack of authenticity in information generation by large - language models (LLMs). Specifically, the paper focuses on the following two aspects: 1. **Factual reliability**: LLMs sometimes generate incorrect information or report uncertain facts with confidence. This behavior may lead to the spread of misleading information and damage users' trust. 2. **Comparison of fact - confidence estimation methods**: Although various techniques have been proposed to estimate the fact - confidence of LLMs, there is a lack of systematic comparison among these methods, resulting in an unclear understanding of which methods are more reliable and robust. ### Solutions To address the above problems, the paper conducts research in the following aspects: 1. **Define an experimental framework**: The paper proposes an experimental framework that allows for a fair comparison of various fact - confidence estimation methods on different models and datasets. This framework covers two tasks: fact - verification and question - answering. 2. **Classification and evaluation of methods**: The paper classifies existing fact - confidence estimation methods into five categories: trained probes, sequence probability, verbalization, surrogate token probability, and consistency. Through experiments, the paper evaluates the performance of these methods on multiple publicly available LLMs. 3. **Impact of input variations**: The paper also explores the performance consistency of LLMs under semantically equivalent inputs. The study finds that the confidence of LLMs is often unstable in the face of semantically equivalent inputs, indicating that there is still much room for improvement in the stability of the model's parametric knowledge. ### Main contributions 1. **Literature review**: The paper reviews the current methods for estimating the fact - confidence of LLMs. 2. **Experimental framework**: Provides an experimental framework that makes the comparison between different methods more fair and systematic. 3. **Method evaluation**: Provides insights into the reliability and robustness of different methods through experiments and offers recommendations for natural - language - processing (NLP) practitioners. ### Conclusions Through systematic experiments and analysis, the paper finds that the trained - probes method is the most reliable in estimating fact - confidence, but it requires access to model weights and training data. For instruction - tuned LLMs, verbalization and consistency - based methods are also viable options. In addition, the paper emphasizes the importance of testing the consistency of the model under different semantically equivalent inputs to ensure that the model maintains stable fact - knowledge encoding in diverse input variations.