Abstract:Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: <a class="link-external link-https" href="https://ibm.biz/wikicontradict" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper focuses on the problem of large-scale language models (LLMs) when dealing with knowledge conflicts from the same trusted source such as Wikipedia. Although retrieval-augmented generation (RAG) techniques have been used to mitigate the issues of incorrect and outdated information in LLMs, it is still unclear how these models handle knowledge conflicts between different retrieved paragraphs, especially when these paragraphs come from the same source and are equally trusted. To address this, the paper proposes a benchmark dataset called WikiContradict, which includes 253 manually annotated instances, to evaluate the performance of LLMs when used with retrieval paragraphs containing real-world knowledge conflicts. The paper conducts benchmark tests on various open-source and proprietary LLMs using multiple question answering (QA) scenarios, including RAG with a single paragraph and RAG with two conflicting paragraphs. Through human evaluation, they found that all models struggle to generate answers accurately reflecting the contextual conflicts when faced with implicit conflicts that require reasoning. Additionally, they developed an automated model to estimate the performance of LLMs, which achieved an F-score of 0.8 on the human evaluation dataset. The paper points out the challenges in current LLMs when handling real-world conflicting information and considers WikiContradict as a valuable resource to facilitate the research community's examination and tracking of LLMs' ability to handle knowledge conflicts in complex real-life scenarios, thus deepening our understanding of their capabilities in these complex situations.

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

Who's Who: Large Language Models Meet Knowledge Conflicts in Practice

Knowledge Conflicts for LLMs: A Survey

Resolving Knowledge Conflicts in Large Language Models

WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models

BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval-Augmented Generation

Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs

Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models

ECon: On the Detection and Resolution of Evidence Conflicts

LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

Adversarial Databases Improve Success in Retrieval-based Large Language Models

Understanding and Mitigating Language Confusion in LLMs

LLM Robustness Against Misinformation in Biomedical Question Answering

Studying Large Language Model Behaviors Under Context-Memory Conflicts With Real Documents