Abstract:Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI) model. This is effective in preventing performance reduction, but at a cost of also discarding relevant passages. Thus, we propose a method for automatically generating data to fine-tune the language model to properly leverage retrieved passages, using a mix of relevant and irrelevant contexts at training time. We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones.

What problem does this paper attempt to address?

The problem this paper attempts to address is the robustness of Retrieval-Augmented Language Models (RALMs) when dealing with irrelevant retrieval information. Specifically, the authors focus on how to ensure that the model's performance is enhanced when the retrieved information is relevant, and not degraded when the retrieved information is irrelevant. This issue is particularly important in multi-hop reasoning scenarios, as the incorrect use of irrelevant evidence can lead to a cascade of errors. By analyzing the situations in five open-domain question-answering benchmarks, the paper identifies when retrieval reduces accuracy and proposes two methods to mitigate this issue: 1. Using Natural Language Inference (NLI) models to filter out retrieved passages that do not contain the question-answer pair. Although this method effectively prevents performance degradation, it also discards some relevant passages. 2. Proposing a method for automatically generating data to fine-tune language models, enabling them to correctly utilize retrieved passages, including in challenging multi-hop tasks, by mixing relevant and irrelevant contexts during training. Experiments show that even 1,000 samples are sufficient to train the model to be robust to irrelevant contexts while maintaining high performance on relevant contexts. In summary, the main contributions of the paper are: - An in-depth analysis of the robustness of RALMs under irrelevant retrieval contexts. - Demonstrating that small NLI models can be used to identify irrelevant contexts and improve robustness without updating model parameters. - Proving that training LLMs to learn when to use retrieval can help the model ignore irrelevant contexts and improve overall performance, especially in complex multi-hop tasks.

Making Retrieval-Augmented Language Models Robust to Irrelevant Context

In-Context Retrieval-Augmented Language Models

Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning

Assessing "Implicit" Retrieval Robustness of Large Language Models

Improving Retrieval Augmented Language Model with Self-Reasoning

Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models

Unraveling and Mitigating Retriever Inconsistencies in Retrieval-Augmented Large Language Models

Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

More Room for Language: Investigating the Effect of Retrieval on Language Models

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

Context Tuning for Retrieval Augmented Generation

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing

Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

Better RAG using Relevant Information Gain

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

REALM: Retrieval-Augmented Language Model Pre-Training

Reimagining Retrieval Augmented Language Models for Answering Queries

Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA