Abstract:While Large Language Models (LLMs) have demonstrated remarkable potential in natural language generation and instruction following, a persistent challenge lies in their susceptibility to "hallucinations", which erodes trust in their outputs. Although Uncertainty Quantification (UQ) presents a promising solution, its accurate implementation within the context of LLMs remains a significant hurdle. To address this critical roadblock, our research originates from a fundamental heuristic insight: tokens within auto-regressive LLM-generated text do not equally reflect the underlying meaning. Some tokens carry greater relevance and representativeness than others, owing to the phenomenon of "linguistic redundancy", wherein a select few keywords suffice to convey the essence of lengthy sentences. Regrettably, existing methodologies treat all tokens with equal importance when estimating uncertainty, disregarding these inherent generative inequalities. Our analysis reveals a significant issue with state-of-the-art: numerous tokens (and sentences) of limited semantic significance receive equal or even excessive weighting during uncertainty estimation. To rectify this bias, we propose to jointly Shifting Attention to more Relevant (SAR) components, at both the token- and the sentence-levels for accurate uncertainty estimation. We conduct extensive experiments involving a range of popular "off-the-shelf" LLMs, including instruction-tuned LLMs such as Vicuna, WizardLM, and LLaMA-2-chat, as well as pretrained LLMs like OPT and LLaMA, with model sizes extending up to 33B parameters. We carry out evaluation across various free-form question-answering tasks, encompassing domains such as reading comprehension, science Q&A, and medical Q&A. Our experimental results demonstrate the superior performance of SAR in addressing the challenges of uncertainty estimation within the realm of LLMs.

LUQ: Long-text Uncertainty Quantification for LLMs

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

LoGU: Long-form Generation with Uncertainty Expressions

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models

Benchmarking LLMs via Uncertainty Quantification

UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

Multi-group Uncertainty Quantification for Long-form Text Generation

A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Investigating Answerability of LLMs for Long-Form Question Answering

Unconditional Truthfulness: Learning Conditional Dependency for Uncertainty Quantification of Large Language Models

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification