Abstract:Large language models (LLMs) have demonstrated strong capabilities in text understanding and generation. However, they often lack factuality, producing a mixture of true and false information, especially in long-form generation. In this work, we investigates the factuality of long-form text generation across various large language models (LLMs), including GPT-4, Gemini-1.5-Pro, Claude-3-Opus, Llama-3-70B, and Mistral. Our analysis reveals that factuality scores tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims. Furthermore, we explore the effectiveness of different evaluation settings to assess whether LLMs can accurately judge the correctness of their own outputs: Self-Known (the percentage of supported atomic claims, decomposed from LLM outputs, that the corresponding LLMs judge as correct) and Self-Unknown (the percentage of unsupported atomic claims that the corresponding LLMs judge as incorrect). The results indicate that even advanced models like GPT-4 and Gemini-1.5-Pro fail to achieve perfect Self-Known scores, while their Self-Unknown scores remain notably above zero, reflecting ongoing uncertainty in their self-assessments. Moreover, we find a correlation between higher Self-Known scores and improved factuality, while higher Self-Unknown scores are associated with lower factuality. Interestingly, even without significant changes in the models' self-judgment (Self-Known and Self-Unknown), the number of unsupported claims can increases, likely as an artifact of long-form generation. These findings show the limitations of current LLMs in long-form generation, and provide valuable insights for improving factuality in long-form text generation.

Measuring text summarization factuality using atomic facts entailment metrics in the context of retrieval augmented generation

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

Language Models Hallucinate, but May Excel at Fact Verification

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Long-form factuality in large language models

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Satyrn: A Platform for Analytics Augmented Generation

Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization

Factuality of Large Language Models: A Survey

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

Annotating and Modeling Fine-grained Factuality in Summarization

Factuality challenges in the era of large language models and opportunities for fact-checking

Evaluating Factual Consistency of Summaries with Large Language Models

Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization

Factuality Enhanced Language Models for Open-Ended Text Generation

Text Summarization Using Large Language Models: A Comparative Study of MPT-7b-instruct, Falcon-7b-instruct, and OpenAI Chat-GPT Models

Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization

AugSumm: towards generalizable speech summarization using synthetic labels from large language model