Abstract:Large language models (LLMs) have demonstrated impressive proficiency in information retrieval, while they are prone to generating incorrect responses that conflict with reality, a phenomenon known as intrinsic hallucination. The critical challenge lies in the unclear and unreliable fact distribution within LLMs trained on vast amounts of data. The prevalent approach frames the factual detection task as a question-answering paradigm, where the LLMs are asked about factual knowledge and examined for correctness. However, existing studies primarily focused on deriving test cases only from several specific domains, such as movies and sports, limiting the comprehensive observation of missing knowledge and the analysis of unexpected hallucinations. To address this issue, we propose OntoFact, an adaptive framework for detecting unknown facts of LLMs, devoted to mining the ontology-level skeleton of the missing knowledge. Specifically, we argue that LLMs could expose the ontology-based similarity among missing facts and introduce five representative knowledge graphs (KGs) as benchmarks. We further devise a sophisticated ontology-driven reinforcement learning (ORL) mechanism to produce error-prone test cases with specific entities and relations automatically. The ORL mechanism rewards the KGs for navigating toward a feasible direction for unveiling factual errors. Moreover, empirical efforts demonstrate that dominant LLMs are biased towards answering Yes rather than No, regardless of whether this knowledge is included. To mitigate the overconfidence of LLMs, we leverage a hallucination-free detection (HFD) strategy to tackle unfair comparisons between baselines, thereby boosting the result robustness. Experimental results on 5 datasets, using 32 representative LLMs, reveal a general lack of fact in current LLMs. Notably, ChatGPT exhibits fact error rates of 51.6% on DBpedia and 64.7% on YAGO, respectively. Additionally, the ORL mechanism demonstrates promising error prediction scores, with F1 scores ranging from 70% to 90% across most LLMs. Compared to the exhaustive testing, ORL achieves an average recall of 80% while reducing evaluation time by 35.29% to 63.12%.

Language Models Hallucinate, but May Excel at Fact Verification

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs Via Ontology-Driven Reinforcement Learning

Long-form factuality in large language models

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

The perils and promises of fact-checking with large language models

A Debate-Driven Experiment on LLM Hallucinations and Accuracy

The Perils & Promises of Fact-checking with Large Language Models

Minimizing Factual Inconsistency and Hallucination in Large Language Models

WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

Factuality challenges in the era of large language models and opportunities for fact-checking

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

The Earth is Flat? Unveiling Factual Errors in Large Language Models

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

Alleviating Hallucinations of Large Language Models through Induced Hallucinations