Anchor Attention, Small Cache: Code Generation with Large Language Models

Xiangyu Zhang,Yu Zhou,Guang Yang,Harald C. Gall,Taolue Chen
2024-11-11
Abstract:The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in code - generation tasks, the Key - Value (KV) cache of large - language models (LLMs) occupies a large amount of memory resources, resulting in high model - deployment costs and an unfriendly environment. Although existing KV - compression methods perform well in natural - language - processing (NLP) tasks, they have limitations when directly applied to code - generation, especially when dealing with long code fragments. These methods may cause the model to be unable to accurately capture key information, thus affecting the quality of code - generation. Specifically, the paper points out that current KV - compression methods tend to make the model focus on local information, which is particularly disadvantageous in code - generation tasks because code itself has complex long - distance dependencies. For example, when generating code, the model needs to consider not only the content of the current file, but also external files such as imported packages, source - code files in the same directory, configuration files, and even API documents. There may be their own dependencies among these files, so relying solely on local - context information is not enough. To address these problems, the paper proposes a new method - AnchorCoder. AnchorCoder reduces the need for KV - cache while maintaining model performance by introducing "anchors" to aggregate and compress context information. Specifically, AnchorCoder takes advantage of the sparsity characteristics of the attention - weight distribution in the code - generation model. By inserting artificially - defined anchors in each line of code and training these anchors as aggregators of context information, the model can perform effective attention operations on fewer KV states, thereby significantly reducing memory overhead. Experimental results show that AnchorCoder can maintain or even improve model performance on multiple benchmark datasets while reducing KV - cache requirements by at least 70%, demonstrating its effectiveness and universality.