How to Train Long-Context Language Models (Effectively)

Tianyu Gao,Alexander Wettig,Howard Yen,Danqi Chen

2024-10-04

Abstract:We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development -- Instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context tasks, and we evaluate models after SFT with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.18B-Instruct on the majority of long-context tasks despite having seen only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the effective training of Long-Context Language Models. Specifically, the research team's goals are: 1. **Establish a reliable evaluation protocol**: Traditional evaluation methods such as perplexity or simple "needle-in-a-haystack" (NIAH) tests are insufficient to guide model development. Therefore, they propose a set of evaluation standards covering various long-context tasks, including Retrieval-Augmented Generation (RAG), long document summarization, and In-Context Learning (ICL) with multiple examples. 2. **Optimize data engineering**: The research found that using only long data can harm model performance, while mixing long data with high-quality short data can improve performance on long-context tasks. The optimal combination is code repositories and book data, mixed with high-quality short data. 3. **Expand data scale and sequence length**: Improve model performance by increasing the amount of training data and the length of training sequences. Experiments show that training with longer sequences beyond the evaluation length helps improve performance on long-context tasks. 4. **Supervised Fine-Tuning (SFT)**: The research shows that using only short instruction datasets for SFT can achieve good long-context performance, while synthesizing long instruction data does not bring significant improvement. Through these studies, the authors propose their final model—ProLong-8B, which performs excellently at a 128K context length and can effectively handle information in a 512K context.

How to Train Long-Context Language Models (Effectively)

LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models

A Controlled Study on Long Context Extension and Generalization in LLMs

Training-Free Long-Context Scaling of Large Language Models

LongAlign: A Recipe for Long Context Alignment of Large Language Models

Training With "Paraphrasing the Original Text'' Improves Long-Context Performance

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models

Empower Your Model with Longer and Better Context Comprehension

Long-context LLMs Struggle with Long In-context Learning

LongReward: Improving Long-context Large Language Models with AI Feedback

Retrieval meets Long Context Large Language Models

Long-Context Language Modeling with Parallel Context Encoding

LooGLE: Can Long-Context Language Models Understand Long Contexts?

Correlation-Aware Select and Merge Attention for Efficient Fine-Tuning and Context Length Extension

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Why Does the Effective Context Length of LLMs Fall Short?

Do Long-Range Language Models Actually Use Long-Range Context?