Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Xinze Li,Yixin Cao,Liangming Pan,Yubo Ma,Aixin Sun

2024-05-23

Abstract:Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns with conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new ``Conscious Incompetence" setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs' citation generation, emphasizing the importance of incorporating the "Conscious Incompetence" setting, and the critical role of retrieval accuracy.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the hallucination problem in large language models (LLMs) when generating answers, where the generated answers may contain factual errors, leading to unreliability. Specifically, the paper proposes a new task—Knowledge-aware Language Model Attribution (KaLMA)—to improve three core issues present in existing attribution methods: 1. **Expanding Attribution Sources**: Extending attribution sources from unstructured text to knowledge graphs (KG), leveraging their rich structure to enhance attribution performance and application scenarios. 2. **Introducing the "Self-aware Ignorance" Setting**: Considering the incompleteness of knowledge bases, the model can recognize when the required supporting knowledge exceeds the provided knowledge graph's scope. 3. **Proposing Comprehensive Evaluation Metrics**: Automatic evaluation metrics covering text quality, citation quality, and text-citation alignment, without the need for manually annotated reference answers. Through these improvements, the paper constructs a dataset in the biography domain called BioKaLMA and develops baseline solutions, demonstrating that there is still room for improvement in citation generation by LLMs. It emphasizes the importance of the "self-aware ignorance" setting and the critical role of retrieval accuracy.

Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs

Automatic Evaluation of Attribution by Large Language Models

Statistical Knowledge Assessment for Large Language Models

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

A Survey of Large Language Models Attribution

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

Assessing the Reliability of Large Language Model Knowledge

Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering

Benchmarking Biomedical Relation Knowledge in Large Language Models

Enhancing Large Language Models with Knowledge Graphs for Robust Question Answering

Enhancing Answer Attribution for Faithful Text Generation with Large Language Models

How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency

Knowledge-Augmented Language Model Verification

ALCUNA: Large Language Models Meet New Knowledge

Attribute or Abstain: Large Language Models as Long Document Assistants

Systematic Assessment of Factual Knowledge in Large Language Models

Enhancing Large Language Models with Pseudo- and Multisource- Knowledge Graphs for Open-ended Question Answering

Advancing Large Language Model Attribution through Self-Improving

Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory