Making Retrieval-Augmented Language Models Robust to Irrelevant Context

Ori Yoran,Tomer Wolfson,Ori Ram,Jonathan Berant
2024-05-05
Abstract:Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the robustness of Retrieval-Augmented Language Models (RALMs) when dealing with irrelevant retrieval information. Specifically, the authors focus on how to ensure that the model's performance is enhanced when the retrieved information is relevant, and not degraded when the retrieved information is irrelevant. This issue is particularly important in multi-hop reasoning scenarios, as the incorrect use of irrelevant evidence can lead to a cascade of errors. By analyzing the situations in five open-domain question-answering benchmarks, the paper identifies when retrieval reduces accuracy and proposes two methods to mitigate this issue: 1. Using Natural Language Inference (NLI) models to filter out retrieved passages that do not contain the question-answer pair. Although this method effectively prevents performance degradation, it also discards some relevant passages. 2. Proposing a method for automatically generating data to fine-tune language models, enabling them to correctly utilize retrieved passages, including in challenging multi-hop tasks, by mixing relevant and irrelevant contexts during training. Experiments show that even 1,000 samples are sufficient to train the model to be robust to irrelevant contexts while maintaining high performance on relevant contexts. In summary, the main contributions of the paper are: - An in-depth analysis of the robustness of RALMs under irrelevant retrieval contexts. - Demonstrating that small NLI models can be used to identify irrelevant contexts and improve robustness without updating model parameters. - Proving that training LLMs to learn when to use retrieval can help the model ignore irrelevant contexts and improve overall performance, especially in complex multi-hop tasks.