StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses

Jia-Nan Li,Quan Tu,Cunli Mao,Zhengtao Yu,Ji-Rong Wen,Rui Yan
2024-10-27
Abstract:Standard Large Language Models (LLMs) struggle with handling dialogues with long contexts due to efficiency and consistency issues. According to our observation, dialogue contexts are highly structured, and the special token of \textit{End-of-Utterance} (EoU) in dialogues has the potential to aggregate information. We refer to the EoU tokens as ``conversational attention sinks'' (conv-attn sinks). Accordingly, we introduce StreamingDialogue, which compresses long dialogue history into conv-attn sinks with minimal losses, and thus reduces computational complexity quadratically with the number of sinks (i.e., the number of utterances). Current LLMs already demonstrate the ability to handle long context window, e.g., a window size of 200K or more. To this end, by compressing utterances into EoUs, our method has the potential to handle more than 200K of utterances, resulting in a prolonged dialogue learning. In order to minimize information losses from reconstruction after compression, we design two learning strategies of short-memory reconstruction (SMR) and long-memory reactivation (LMR). Our method outperforms strong baselines in dialogue tasks and achieves a 4 $\times$ speedup while reducing memory usage by 18 $\times$ compared to dense attention recomputation.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the efficiency and consistency problems encountered by large - language models (LLMs) when dealing with long - dialogue contexts. Specifically, standard large - language models face the following challenges when handling long dialogues: 1. **High computational complexity**: The computational complexity caused by the Attention Mechanism increases quadratically with the increase in text length, which increases the GPU memory usage and slows down the generation speed. 2. **Context window limitation**: For example, when the context length exceeds the preset limit (such as 4,096 tokens), the inference ability of LLaMA2 will decline sharply. 3. **Insufficient long - term memory**: Although existing methods such as StreamingLLM support long - time conversations by introducing "attention sinks", these methods will gradually lose historical information as the dialogue context becomes longer, affecting the consistency of the dialogue and user experience. To address these problems, the paper proposes the **StreamingDialogue** method, and its main contributions include: 1. **Discovering and utilizing "conv - attn sinks"**: The authors observe that tokens used to separate dialogues (such as End - of - Utterance, EoU) can gather more attention, so these tokens are defined as "conv - attn sinks". By only caching these sinks and their related key - values, the computational complexity and memory consumption can be significantly reduced. 2. **Proposing two learning strategies**: Short - Memory Reconstruction (SMR) and Long - Memory Reactivation (LMR) to enhance the information aggregation ability and long - term memory ability of the model. 3. **Experimental verification**: The experimental results show that StreamingDialogue is superior to other sparse - attention and enhanced - memory methods in dialogue tasks and performs better on multiple evaluation metrics (such as Perplexity, BLEU, ROUGE, Distinct, USL - H and Dial - M). At the same time, this method also performs well in terms of efficiency, achieving a 4 - fold speed increase and an 18 - fold reduction in memory usage compared to dense - attention recomputation. Through these innovations, StreamingDialogue can not only efficiently handle long - dialogue contexts, but also maintain the consistency and fluency of the dialogue, providing strong support for realizing long - time dialogue learning.