Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Quang Hieu Pham,Hoang Ngo,Anh Tuan Luu,Dat Quoc Nguyen
2024-10-21
Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.
Computation and Language,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The paper attempts to address the issue of how large language models (LLMs) handle and present information in the context of knowledge conflicts. Specifically, the paper focuses on whether LLMs can transparently inform users of these conflicts when multiple entities with the same name have different attributes, rather than autonomously deciding which information to present based on their own biases. To investigate this issue, the authors constructed a public benchmark dataset called WhoQA to evaluate the performance of LLMs in situations of knowledge conflict. The WhoQA dataset induces knowledge conflicts by posing questions about shared attributes of entities with the same name, with each question potentially having 2 to 8 different answers. The dataset includes 5,152 questions, involving 13 types of Wikidata properties and 150,000 Wikipedia entities. The main contributions of the paper include: 1. **Highlighting the prevalence of knowledge conflicts in real-world scenarios** and manually constructing an evaluation set containing 5,152 questions and their supporting evidence as a gold standard benchmark. 2. **Conducting extensive experiments** using powerful LLMs for testing, with results showing that knowledge conflicts pose significant challenges to LLMs' performance, potentially leading to misleading or biased outcomes. 3. **Publicly releasing the WhoQA dataset** to facilitate future research in the field of knowledge conflicts in LLMs. Through this work, the authors hope to promote greater transparency in LLMs when handling knowledge conflicts, thereby improving the quality of user decision-making.