With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation

Y. Wang,D. Ma,D. Cai
2024-09-11
Abstract:Long text generation, such as novel writing and discourse-level translation with extremely long contexts, presents significant challenges to current language models. Existing methods mainly focus on extending the model's context window through strategies like length extrapolation. However, these approaches demand substantial hardware resources during the training and/or inference phases. Our proposed method, Temp-Lora, introduces an alternative concept. Instead of relying on the KV cache to store all context information, we embeds this information directly into a temporary Lora module. In the process of long text generation, this module is progressively trained with text generated previously. This approach not only efficiently preserves contextual knowledge but also prevents any permanent alteration to the model's parameters given that the module is discarded post-generation. Extensive experiments on the PG19 language modeling benchmark and the GuoFeng discourse-level translation benchmark validate the effectiveness of Temp-Lora. Our results show that: 1) Temp-Lora substantially enhances generation quality for long text, as indicated by a 13.2% decrease in perplexity (PPL) on a subset of PG19, and a 29.3% decrease in PPL along with a 113.2% increase in BLEU score on a subset of GuoFeng, 2) Temp-Lora is compatible with and enhances most existing long text generation methods, and 3) Temp-Lora can greatly reduce computational costs by shortening the context window. For example, we can ensure a moderate improvement in generation quality (a decrease of 3.8% in PPL) while enabling a 51.5% memory usage reduction and a 60.0% decrease in latency for inference.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the significant challenges faced by language models in long text generation (such as novel writing and long-form translation). Existing methods primarily tackle these challenges by extending the model's context window, but these methods require substantial hardware resources during training and inference stages. Specifically, the paper proposes a new method called **Temp-Lora**, which embeds context information directly into a temporary Lora module instead of relying on KV cache to store all context information. During the long text generation process, this module is trained incrementally with previously generated text. This method not only efficiently retains contextual knowledge but also discards the module after generation, avoiding permanent changes to the model parameters. ### Main Contributions 1. **Improved Generation Quality**: Experimental results show that Temp-Lora significantly improves the quality of long text generation across multiple benchmark datasets. For example, on a subset of the PG19 dataset, the perplexity (PPL) is reduced by 13.2%, and on a subset of the GuoFeng dataset, the perplexity is reduced by 29.3%, with BLEU scores increasing by 113.2%. 2. **Compatibility with Existing Methods**: Temp-Lora can be combined with most existing long text generation methods to further enhance performance. 3. **Reduced Computational Cost**: By shortening the context window, Temp-Lora can significantly reduce memory usage and inference latency. For instance, while maintaining a moderate improvement in generation quality (PPL reduced by 3.4%), it achieves a 51.5% reduction in memory usage and a 60.0% reduction in inference latency. ### Experimental Validation The paper conducts extensive experimental validation on two benchmark datasets: 1. **PG19**: This is a long text language modeling benchmark containing over 28,000 books published before 1919. Experimental results show that Temp-Lora significantly reduces perplexity across text segments of different lengths. 2. **GuoFeng**: This is a subset randomly selected from WMT 2023, containing 20 web novels written in Chinese by different novelists and translated into English by professional translators. Experimental results indicate that Temp-Lora also performs excellently in dialogue-level literary translation tasks, significantly improving BLEU and COMET scores. ### Conclusion The proposed method Temp-Lora effectively addresses the context information storage problem in long text generation by dynamically updating a temporary Lora module during the generation process. Experimental results validate the method's effectiveness in improving generation quality and reducing computational costs. This method is not only applicable to long text generation tasks but can also be applied to other downstream tasks, such as dialogue-level literary translation.