Abstract:Long text generation, such as novel writing and discourse-level translation with extremely long contexts, presents significant challenges to current language models. Existing methods mainly focus on extending the model's context window through strategies like length extrapolation. However, these approaches demand substantial hardware resources during the training and/or inference phases. Our proposed method, Temp-Lora, introduces an alternative concept. Instead of relying on the KV cache to store all context information, we embeds this information directly into a temporary Lora module. In the process of long text generation, this module is progressively trained with text generated previously. This approach not only efficiently preserves contextual knowledge but also prevents any permanent alteration to the model's parameters given that the module is discarded post-generation. Extensive experiments on the PG19 language modeling benchmark and the GuoFeng discourse-level translation benchmark validate the effectiveness of Temp-Lora. Our results show that: 1) Temp-Lora substantially enhances generation quality for long text, as indicated by a 13.2% decrease in perplexity (PPL) on a subset of PG19, and a 29.3% decrease in PPL along with a 113.2% increase in BLEU score on a subset of GuoFeng, 2) Temp-Lora is compatible with and enhances most existing long text generation methods, and 3) Temp-Lora can greatly reduce computational costs by shortening the context window. For example, we can ensure a moderate improvement in generation quality (a decrease of 3.8% in PPL) while enabling a 51.5% memory usage reduction and a 60.0% decrease in latency for inference.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the significant challenges faced by language models in long text generation (such as novel writing and long-form translation). Existing methods primarily tackle these challenges by extending the model's context window, but these methods require substantial hardware resources during training and inference stages. Specifically, the paper proposes a new method called **Temp-Lora**, which embeds context information directly into a temporary Lora module instead of relying on KV cache to store all context information. During the long text generation process, this module is trained incrementally with previously generated text. This method not only efficiently retains contextual knowledge but also discards the module after generation, avoiding permanent changes to the model parameters. ### Main Contributions 1. **Improved Generation Quality**: Experimental results show that Temp-Lora significantly improves the quality of long text generation across multiple benchmark datasets. For example, on a subset of the PG19 dataset, the perplexity (PPL) is reduced by 13.2%, and on a subset of the GuoFeng dataset, the perplexity is reduced by 29.3%, with BLEU scores increasing by 113.2%. 2. **Compatibility with Existing Methods**: Temp-Lora can be combined with most existing long text generation methods to further enhance performance. 3. **Reduced Computational Cost**: By shortening the context window, Temp-Lora can significantly reduce memory usage and inference latency. For instance, while maintaining a moderate improvement in generation quality (PPL reduced by 3.4%), it achieves a 51.5% reduction in memory usage and a 60.0% reduction in inference latency. ### Experimental Validation The paper conducts extensive experimental validation on two benchmark datasets: 1. **PG19**: This is a long text language modeling benchmark containing over 28,000 books published before 1919. Experimental results show that Temp-Lora significantly reduces perplexity across text segments of different lengths. 2. **GuoFeng**: This is a subset randomly selected from WMT 2023, containing 20 web novels written in Chinese by different novelists and translated into English by professional translators. Experimental results indicate that Temp-Lora also performs excellently in dialogue-level literary translation tasks, significantly improving BLEU and COMET scores. ### Conclusion The proposed method Temp-Lora effectively addresses the context information storage problem in long text generation by dynamically updating a temporary Lora module during the generation process. Experimental results validate the method's effectiveness in improving generation quality and reducing computational costs. This method is not only applicable to long text generation tasks but can also be applied to other downstream tasks, such as dialogue-level literary translation.

With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Training With "Paraphrasing the Original Text'' Improves Long-Context Performance

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

How to Train Long-Context Language Models (Effectively)

Language Models can Self-Lengthen to Generate Long Texts

Empower Your Model with Longer and Better Context Comprehension

Retrieval meets Long Context Large Language Models

Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding

Open-ended Long Text Generation via Masked Language Modeling

Why Does the Effective Context Length of LLMs Fall Short?

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Extending Context Window of Large Language Models via Semantic Compression

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Fixed global memory for controllable long text generation

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Augmenting Language Models with Long-Term Memory