Abstract:Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention mechanisms and the growing memory consumption of the key-value cache during generation. This work introduces MemLong: Memory-Augmented Retrieval for Long Text Generation, a method designed to enhance the capabilities of long-context language modeling by utilizing an external retriever for historical information retrieval. MemLong combines a non-differentiable ``ret-mem'' module with a partially trainable decoder-only language model and introduces a fine-grained, controllable retrieval attention mechanism that leverages semantic-level relevant chunks. Comprehensive evaluations on multiple long-context language modeling benchmarks demonstrate that MemLong consistently outperforms other state-of-the-art LLMs. More importantly, MemLong can extend the context length on a single 3090 GPU from 4k up to 80k. Our code is available at <a class="link-external link-https" href="https://github.com/Bui1dMySea/MemLong" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper "MemLong: Memory-Augmented Retrieval for Long Text Modeling" aims to address the challenges faced by large language models (LLMs) when processing long texts. Specifically, existing LLMs encounter the following issues when dealing with long contexts: 1. **High Time and Space Complexity**: Traditional attention mechanisms have quadratic time and space complexity, making it very time-consuming and memory-intensive to process long texts. 2. **High Memory Consumption for Caching**: During generation, the memory consumption of the key-value cache increases rapidly with the context length, leading to out-of-memory (OOM) issues. 3. **Limited Model Capability**: Some existing methods can reduce computational complexity but often at the cost of model performance. To address these issues, the authors propose MemLong, a method that enhances long text modeling capabilities through an external retriever. The main contributions of MemLong include: - **Distribution Consistency**: Ensuring that the distribution of information stored in memory remains consistent, avoiding distribution shifts caused by changes in model parameters. - **Training Efficiency**: By freezing the lower layers of the model and only fine-tuning the upper layers, computational costs are significantly reduced. - **Extended Context Window**: Capable of extending the context length from 4k to 80k on a single 3090 GPU, significantly improving the model's ability to handle long texts. ### Summary MemLong introduces a fine-grained, controllable retrieval attention mechanism by combining a non-differentiable retrieval module with a partially trainable decoder language model. It leverages semantically relevant fragments to enhance long text modeling capabilities. Experimental results show that MemLong performs exceptionally well on multiple long-context language modeling benchmarks, significantly outperforming other state-of-the-art LLMs.

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Augmenting Language Models with Long-Term Memory

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

Retrieval meets Long Context Large Language Models

UniMem: Towards a Unified View of Long-Context Large Language Models

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models

Needle in the Haystack for Memory Based Large Language Models

LaMemo: Language Modeling with Look-Ahead Memory

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Enhancing Large Language Model with Self-Controlled Memory Framework

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

RET-LLM: Towards a General Read-Write Memory for Large Language Models

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Visual Context Window Extension: A New Perspective for Long Video Understanding