Abstract:Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. According to our observation, dialogue contexts are highly structured, and the special token of \textit{End-of-Utterance} (EoU) in dialogues has the potential to aggregate information. We refer to the EoU tokens as ``conversational attention sinks'' (conv-attn sinks). Accordingly, we introduce StreamingDialogue, which compresses long dialogue history into conv-attn sinks with minimal losses, and thus reduces computational complexity quadratically with the number of sinks (i.e., the number of utterances). Current LLMs already demonstrate the ability to handle long context window, e.g., a window size of 200K or more. To this end, by compressing utterances into EoUs, our method has the potential to handle more than 200K of utterances, resulting in a prolonged dialogue learning. In order to minimize information losses from reconstruction after compression, we design two learning strategies of short-memory reconstruction (SMR) and long-memory reactivation (LMR). Our method outperforms strong baselines in dialogue tasks and achieves a 4 $\times$ speedup while reducing memory usage by 18 $\times$ compared to dense attention recomputation.

What problem does this paper attempt to address?

This paper attempts to solve the efficiency and consistency problems encountered by large - language models (LLMs) when dealing with long - dialogue contexts. Specifically, standard large - language models face the following challenges when handling long dialogues: 1. **High computational complexity**: The computational complexity caused by the Attention Mechanism increases quadratically with the increase in text length, which increases the GPU memory usage and slows down the generation speed. 2. **Context window limitation**: For example, when the context length exceeds the preset limit (such as 4,096 tokens), the inference ability of LLaMA2 will decline sharply. 3. **Insufficient long - term memory**: Although existing methods such as StreamingLLM support long - time conversations by introducing "attention sinks", these methods will gradually lose historical information as the dialogue context becomes longer, affecting the consistency of the dialogue and user experience. To address these problems, the paper proposes the **StreamingDialogue** method, and its main contributions include: 1. **Discovering and utilizing "conv - attn sinks"**: The authors observe that tokens used to separate dialogues (such as End - of - Utterance, EoU) can gather more attention, so these tokens are defined as "conv - attn sinks". By only caching these sinks and their related key - values, the computational complexity and memory consumption can be significantly reduced. 2. **Proposing two learning strategies**: Short - Memory Reconstruction (SMR) and Long - Memory Reactivation (LMR) to enhance the information aggregation ability and long - term memory ability of the model. 3. **Experimental verification**: The experimental results show that StreamingDialogue is superior to other sparse - attention and enhanced - memory methods in dialogue tasks and performs better on multiple evaluation metrics (such as Perplexity, BLEU, ROUGE, Distinct, USL - H and Dial - M). At the same time, this method also performs well in terms of efficiency, achieving a 4 - fold speed increase and an 18 - fold reduction in memory usage compared to dense - attention recomputation. Through these innovations, StreamingDialogue can not only efficiently handle long - dialogue contexts, but also maintain the consistency and fluency of the dialogue, providing strong support for realizing long - time dialogue learning.

StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses

Efficient Streaming Language Models with Attention Sinks

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Extending Context Window of Large Language Models via Semantic Compression

LanguaShrink: Reducing Token Overhead with Psycholinguistics

Efficient Streaming LLM for Speech Recognition

Perception Compressor:A training-free prompt compression method in long context scenarios

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adapting LLMs for Efficient Context Processing through Soft Prompt Compression

An Exploratory Study on Long Dialogue Summarization: What Works and What's Next

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models

DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization

Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Enabling Real-Time Conversations with Minimal Training Costs

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations

Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries