WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Yufang Hou,Alessandra Pascale,Javier Carnerero-Cano,Tigran Tchrakian,Radu Marinescu,Elizabeth Daly,Inkit Padhi,Prasanna Sattigeri
2024-06-20
Abstract:Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: <a class="link-external link-https" href="https://ibm.biz/wikicontradict" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper focuses on the problem of large-scale language models (LLMs) when dealing with knowledge conflicts from the same trusted source such as Wikipedia. Although retrieval-augmented generation (RAG) techniques have been used to mitigate the issues of incorrect and outdated information in LLMs, it is still unclear how these models handle knowledge conflicts between different retrieved paragraphs, especially when these paragraphs come from the same source and are equally trusted. To address this, the paper proposes a benchmark dataset called WikiContradict, which includes 253 manually annotated instances, to evaluate the performance of LLMs when used with retrieval paragraphs containing real-world knowledge conflicts. The paper conducts benchmark tests on various open-source and proprietary LLMs using multiple question answering (QA) scenarios, including RAG with a single paragraph and RAG with two conflicting paragraphs. Through human evaluation, they found that all models struggle to generate answers accurately reflecting the contextual conflicts when faced with implicit conflicts that require reasoning. Additionally, they developed an automated model to estimate the performance of LLMs, which achieved an F-score of 0.8 on the human evaluation dataset. The paper points out the challenges in current LLMs when handling real-world conflicting information and considers WikiContradict as a valuable resource to facilitate the research community's examination and tracking of LLMs' ability to handle knowledge conflicts in complex real-life scenarios, thus deepening our understanding of their capabilities in these complex situations.