Abstract:Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: <a class="link-external link-https" href="https://ibm.biz/wikicontradict" rel="external noopener nofollow">this https URL</a>.

WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia

Computing Semantic Relatedness Using Structured Information of Wikipedia

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Contradiction Detection with Contradiction-Specific Word Embedding

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

Red Teaming Language Models for Processing Contradictory Dialogues

A Linguistic Investigation of Machine Learning based Contradiction Detection Models: An Empirical Analysis and Future Perspectives

CDConv: A Benchmark for Contradiction Detection in Chinese Conversations

Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions

Harnessing the Power of Text-image Contrastive Models for Automatic Detection of Online Misinformation

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Towards Semantic Modeling of Contradictions and Disagreements: A Case Study of Medical Guidelines

Generating Prototypes for Contradiction Detection Using Large Language Models and Linguistic Rules

A Large Collection of Model-generated Contradictory Responses for Consistency-aware Dialogue Systems

Error Link Detection and Correction in Wikipedia.

I like fish, especially dolphins: Addressing Contradictions in Dialogue Modeling

ClaimDiff: Comparing and Contrasting Claims on Contentious Issues

Longitudinal Assessment of Reference Quality on Wikipedia

Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

Language-Agnostic Modeling of Source Reliability on Wikipedia

Contrastive sentence representation learning with adaptive false negative cancellation