Abstract:Recent advancements in large language models (LLMs) have considerably advanced the capabilities of summarization systems. However, they continue to face concerns about hallucinations. While prior work has evaluated LLMs extensively in news domains, most evaluation of dialogue summarization has focused on BART-based models, leaving a gap in our understanding of their faithfulness. Our work benchmarks the faithfulness of LLMs for dialogue summarization, using human annotations and focusing on identifying and categorizing span-level inconsistencies. Specifically, we focus on two prominent LLMs: GPT-4 and Alpaca-13B. Our evaluation reveals subtleties as to what constitutes a hallucination: LLMs often generate plausible inferences, supported by circumstantial evidence in the conversation, that lack direct evidence, a pattern that is less prevalent in older models. We propose a refined taxonomy of errors, coining the category of "Circumstantial Inference" to bucket these LLM behaviors and release the dataset. Using our taxonomy, we compare the behavioral differences between LLMs and older fine-tuned models. Additionally, we systematically assess the efficacy of automatic error detection methods on LLM summaries and find that they struggle to detect these nuanced errors. To address this, we introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly for identifying "Circumstantial Inference."

Why LLMs Hallucinate, and How to Get (Evidential) Closure: Perceptual, Intensional, and Extensional Learning for Faithful Natural Language Generation

Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States

LLMs Will Always Hallucinate, and We Need to Live With This

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Look Within, Why LLMs Hallucinate: A Causal Perspective

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation

Sources of Hallucination by Large Language Models on Inference Tasks

Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models

Redefining "Hallucination" in LLMs: Towards a psychology-informed framework for mitigating misinformation

Comprehending and Reducing LLM Hallucinations

Banishing LLM Hallucinations Requires Rethinking Generalization

A Debate-Driven Experiment on LLM Hallucinations and Accuracy

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

Misinforming LLMs: vulnerabilities, challenges and opportunities

LLMs' Understanding of Natural Language Revealed

Unravelling the Mysteries of Hallucination in Large Language Models: Strategies for Precision in Artificial Intelligence Language Generation

LLM Internal States Reveal Hallucination Risk Faced With a Query

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Hallucination is Inevitable: An Innate Limitation of Large Language Models