Abstract:Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in real-world tasks such as long-context retrieval augmented generation.

What problem does this paper attempt to address?

The problem this paper attempts to address is the poor performance of current general long-context language models in practical long-context processing tasks, especially when specific task data is required. Although long-context modeling is crucial for handling complex information, existing open-weight general long-context models still have shortcomings in practical applications. This is mainly because effective long-context modeling requires specific task data, which can be very costly. To solve this problem, the authors draw inspiration from the way humans handle large amounts of information and propose a new method—ACER (Automatic Context Extension via Retrieval). This method synthesizes data by combining retrieval and short-context language models, and further fine-tunes large language models to acquire long-context capabilities for specific tasks. Specifically, the ACER process includes two main stages: 1. **Automatic Data Synthesis**: In this stage, long contexts are split into multiple text blocks, a retrieval model scores and ranks these text blocks, and then the top-ranked text blocks are fed into a short-context generation model to generate answers with Chain-of-Thought (CoT). 2. **Self-Training**: In this stage, the synthesized data is used to fine-tune the large language model, enabling it to better understand and process long contexts. During training, the model learns how to extract useful information from the complete context and generate reasonable answers. In this way, ACER can generate high-quality long-context data without supervision and has shown superior performance compared to existing general long-context models in experiments, particularly in tasks such as long-context retrieval-augmented generation (RAG).

ACER: Automatic Language Model Context Extension via Retrieval

Understanding Synthetic Context Extension via Retrieval Heads

Retrieval meets Long Context Large Language Models

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Two are better than one: Context window extension with multi-grained self-injection

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Empower Your Model with Longer and Better Context Comprehension

A Controlled Study on Long Context Extension and Generalization in LLMs

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Chain of Agents: Large Language Models Collaborating on Long-Context Tasks

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

Long-Context Language Modeling with Parallel Context Encoding

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Large Language Models Can Self-Improve in Long-context Reasoning

Decoupled Context Processing for Context Augmented Language Modeling

In-Context Retrieval-Augmented Language Models