ACER: Automatic Language Model Context Extension via Retrieval

Luyu Gao,Yunyi Zhang,Jamie Callan
2024-10-12
Abstract:Long-context modeling is one of the critical capabilities of language AI for digesting and reasoning over complex information pieces. In practice, long-context capabilities are typically built into a pre-trained language model~(LM) through a carefully designed context extension stage, with the goal of producing generalist long-context capabilities. In our preliminary experiments, however, we discovered that the current open-weight generalist long-context models are still lacking in practical long-context processing tasks. While this means perfectly effective long-context modeling demands task-specific data, the cost can be prohibitive. In this paper, we draw inspiration from how humans process a large body of information: a lossy \textbf{retrieval} stage ranks a large set of documents while the reader ends up reading deeply only the top candidates. We build an \textbf{automatic} data synthesis pipeline that mimics this process using short-context LMs. The short-context LMs are further tuned using these self-generated data to obtain task-specific long-context capabilities. Similar to how pre-training learns from imperfect data, we hypothesize and further demonstrate that the short-context model can bootstrap over the synthetic data, outperforming not only long-context generalist models but also the retrieval and read pipeline used to synthesize the training data in real-world tasks such as long-context retrieval augmented generation.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the poor performance of current general long-context language models in practical long-context processing tasks, especially when specific task data is required. Although long-context modeling is crucial for handling complex information, existing open-weight general long-context models still have shortcomings in practical applications. This is mainly because effective long-context modeling requires specific task data, which can be very costly. To solve this problem, the authors draw inspiration from the way humans handle large amounts of information and propose a new method—ACER (Automatic Context Extension via Retrieval). This method synthesizes data by combining retrieval and short-context language models, and further fine-tunes large language models to acquire long-context capabilities for specific tasks. Specifically, the ACER process includes two main stages: 1. **Automatic Data Synthesis**: In this stage, long contexts are split into multiple text blocks, a retrieval model scores and ranks these text blocks, and then the top-ranked text blocks are fed into a short-context generation model to generate answers with Chain-of-Thought (CoT). 2. **Self-Training**: In this stage, the synthesized data is used to fine-tune the large language model, enabling it to better understand and process long contexts. During training, the model learns how to extract useful information from the complete context and generate reasonable answers. In this way, ACER can generate high-quality long-context data without supervision and has shown superior performance compared to existing general long-context models in experiments, particularly in tasks such as long-context retrieval-augmented generation (RAG).