ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Huayang Li,Pat Verga,Priyanka Sen,Bowen Yang,Vijay Viswanathan,Patrick Lewis,Taro Watanabe,Yixuan Su

2024-10-04

Abstract:The context window of large language models (LLMs) has been extended significantly in recent years. However, while the context length that the LLM can process has grown, the capability of the model to accurately reason over that context degrades noticeably. This occurs because modern LLMs often become overwhelmed by the vast amount of information in the context; when answering questions, the model must identify and reason over relevant evidence sparsely distributed throughout the text. To alleviate the challenge of long-context reasoning, we develop a retrieve-then-reason framework, enabling LLMs to reason over relevant evidence collected during an intermediate retrieval step. We find that modern LLMs struggle to accurately retrieve relevant facts and instead, often hallucinate "retrieved facts", resulting in flawed reasoning and the production of incorrect answers. To address these issues, we introduce ALR$^2$, a method that augments the long-context reasoning capability of LLMs via an explicit two-stage procedure, i.e., aligning LLMs with the objectives of both retrieval and reasoning. We demonstrate the efficacy of ALR$^2$ for mitigating performance degradation in long-context reasoning tasks. Through extensive experiments on long-context QA benchmarks, we find our method to outperform competitive baselines by large margins, achieving at least 8.4 and 7.9 EM gains on the long-context versions of HotpotQA and SQuAD datasets, respectively.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of inference performance degradation faced by large language models (LLMs) when handling long text contexts. Although the context window of LLMs has significantly expanded in recent years, the inference ability of the models noticeably declines when dealing with long text contexts. This is because modern LLMs are often overwhelmed by a large amount of information, making it difficult to accurately retrieve and reason from sparsely distributed relevant evidence. To solve this problem, the authors propose a method called ALR2, a two-stage retrieval-reasoning framework that enhances the ability of LLMs to handle long text contexts through an explicit intermediate retrieval step. Specifically, the ALR2 method first retrieves relevant facts and then reasons based on these retrieved facts to generate the final answer. The study found that modern LLMs are prone to hallucination when directly retrieving relevant facts from long text contexts, leading to inference errors. ALR2 significantly improves the performance of long text context inference tasks by aligning the retrieval and reasoning objectives and achieves results significantly better than baseline models on the HotpotQA and SQuAD datasets. Additionally, ALR2 demonstrates good generalization ability, performing well on unseen datasets.

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Retrieval-Augmented Chain-of-Thought in Semi-structured Domains

PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Retrieval meets Long Context Large Language Models

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

CoQ:AN Empirical Framework for Multi-hop Question Answering Empowered by Large Language Models

Large Language Models Can Self-Improve in Long-context Reasoning

Investigating Answerability of LLMs for Long-Form Question Answering

Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Enhancing Large Language Models' Situated Faithfulness to External Contexts

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation

Improving Retrieval Augmented Language Model with Self-Reasoning

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style