Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Quang Hieu Pham,Hoang Ngo,Anh Tuan Luu,Dat Quoc Nguyen

2024-10-21

Abstract:Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.

Computation and Language,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

The paper attempts to address the issue of how large language models (LLMs) handle and present information in the context of knowledge conflicts. Specifically, the paper focuses on whether LLMs can transparently inform users of these conflicts when multiple entities with the same name have different attributes, rather than autonomously deciding which information to present based on their own biases. To investigate this issue, the authors constructed a public benchmark dataset called WhoQA to evaluate the performance of LLMs in situations of knowledge conflict. The WhoQA dataset induces knowledge conflicts by posing questions about shared attributes of entities with the same name, with each question potentially having 2 to 8 different answers. The dataset includes 5,152 questions, involving 13 types of Wikidata properties and 150,000 Wikipedia entities. The main contributions of the paper include: 1. **Highlighting the prevalence of knowledge conflicts in real-world scenarios** and manually constructing an evaluation set containing 5,152 questions and their supporting evidence as a gold standard benchmark. 2. **Conducting extensive experiments** using powerful LLMs for testing, with results showing that knowledge conflicts pose significant challenges to LLMs' performance, potentially leading to misleading or biased outcomes. 3. **Publicly releasing the WhoQA dataset** to facilitate future research in the field of knowledge conflicts in LLMs. Through this work, the authors hope to promote greater transparency in LLMs when handling knowledge conflicts, thereby improving the quality of user decision-making.

Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Resolving Knowledge Conflicts in Large Language Models

Knowledge Conflicts for LLMs: A Survey

DYNAMICQA: Tracing Internal Knowledge Conflicts in Language Models

Analysing the Residual Stream of Language Models Under Knowledge Conflicts

Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

Studying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents

Open Domain Question Answering with Conflicting Contexts

Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment

QA-RAG: Exploring LLM Reliance on External Knowledge

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models

Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models

Statistical Knowledge Assessment for Large Language Models

Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models

Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation

Enhancing Large Language Models with Knowledge Graphs for Robust Question Answering

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models