Abstract:Large language models have demonstrated significant potential as the next-generation information access engines. However, their reliability is hindered by issues of hallucination and generating non-factual content. This is particularly problematic in long-form responses, where assessing and ensuring factual accuracy is complex. In this paper, we address this gap by proposing FactAlign, a novel alignment framework designed to enhance the factuality of LLMs' long-form responses while maintaining their helpfulness. We introduce fKTO, a fine-grained, sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent advances in automatic factuality evaluation, FactAlign utilizes fine-grained factuality assessments to guide the alignment process. Our experiments on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM responses while also improving their helpfulness. Further analyses identify that FactAlign is capable of training LLMs to provide more information without losing factual precision, thus improving the factual F1 score. Our source code, datasets, and trained models are publicly available at <a class="link-external link-https" href="https://github.com/MiuLab/FactAlign" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper attempts to address the issues of misinformation and hallucination in large - language models (LLMs) when generating long - form answers. Specifically, the paper proposes a new framework named FACTALIGN, aiming to enhance the factual accuracy of long - form answers generated by LLMs while maintaining their helpfulness. ### Main Problems 1. **Hallucination and Non - Factual Content**: LLMs are prone to hallucination (i.e., generating incorrect or non - existent information) when generating long - form answers, which seriously affects their reliability and credibility in practical applications. 2. **Evaluation Complexity**: For long - form answers, evaluating and ensuring their factual accuracy is a complex task because long - form answers contain multiple sub - claims, and each sub - claim needs to be verified separately. ### Solutions To solve the above problems, the paper proposes the following solutions: 1. **FACTALIGN Framework**: - **fKTO Algorithm**: Introduces a fine - grained sentence - level alignment algorithm fKTO, which extends the Kahneman - Tversky Optimization (KTO) alignment method to utilize the fine - grained signals provided by automatic fact evaluators. - **Automatic Long - Form Fact Evaluator**: Uses automated tools to conduct fine - grained factual evaluation of long - form answers, reducing the cost of manual annotation. 2. **Iterative Optimization**: - Regularly use the trained model to generate new answers and re - evaluate their factual accuracy to reduce distribution shift and improve the alignment effect. 3. **Multi - stage Alignment**: - **Response - level Alignment**: Align the entire answer through the standard KTO loss function. - **Sentence - level Alignment**: Align each sentence through the fKTO loss function, thus adjusting the model more effectively. ### Experimental Results The experimental results show that the FACTALIGN framework significantly improves the factual accuracy and helpfulness of long - form answers generated by LLMs. Specifically: - There is a 40.1% relative improvement on the f1@100 metric. - The average score in the MT - Bench benchmark test has increased by 29.2%. - Compared with larger models such as GPT - 3.5 - Turbo and LLaMA - 2 - 70B - Chat, it performs better on the f1@100 and FactScore metrics. These results demonstrate that through fine - grained alignment, smaller LMs can surpass large LMs in the general domain in terms of factual accuracy. ### Summary By proposing the FACTALIGN framework, this paper successfully solves the problems of factual accuracy and reliability in LLMs when generating long - form answers, providing an effective method for improving the practical application value of LLMs.

FactAlign: Long-form Factuality Alignment of Large Language Models

Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity

Long-form factuality in large language models

FLAME: Factuality-Aware Alignment for Large Language Models

Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation

Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models

OntoFact: Unveiling Fantastic Fact-Skeleton of LLMs Via Ontology-Driven Reinforcement Learning

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness

FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

Factuality of Large Language Models: A Survey

UFO: a Unified and Flexible Framework for Evaluating Factuality of Large Language Models

Language Models Hallucinate, but May Excel at Fact Verification

FELM: Benchmarking Factuality Evaluation of Large Language Models

Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations

Text Alignment Is An Efficient Unified Model for Massive NLP Tasks

Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown

Improving Factuality in Large Language Models via Decoding-Time Hallucinatory and Truthful Comparators

LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong