Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution

Xinze Li,Yixin Cao,Liangming Pan,Yubo Ma,Aixin Sun
2024-05-23
Abstract:Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns with conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new ``Conscious Incompetence" setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs' citation generation, emphasizing the importance of incorporating the "Conscious Incompetence" setting, and the critical role of retrieval accuracy.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the hallucination problem in large language models (LLMs) when generating answers, where the generated answers may contain factual errors, leading to unreliability. Specifically, the paper proposes a new task—Knowledge-aware Language Model Attribution (KaLMA)—to improve three core issues present in existing attribution methods: 1. **Expanding Attribution Sources**: Extending attribution sources from unstructured text to knowledge graphs (KG), leveraging their rich structure to enhance attribution performance and application scenarios. 2. **Introducing the "Self-aware Ignorance" Setting**: Considering the incompleteness of knowledge bases, the model can recognize when the required supporting knowledge exceeds the provided knowledge graph's scope. 3. **Proposing Comprehensive Evaluation Metrics**: Automatic evaluation metrics covering text quality, citation quality, and text-citation alignment, without the need for manually annotated reference answers. Through these improvements, the paper constructs a dataset in the biography domain called BioKaLMA and develops baseline solutions, demonstrating that there is still room for improvement in citation generation by LLMs. It emphasizes the importance of the "self-aware ignorance" setting and the critical role of retrieval accuracy.