Abstract:The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in code - generation tasks, the Key - Value (KV) cache of large - language models (LLMs) occupies a large amount of memory resources, resulting in high model - deployment costs and an unfriendly environment. Although existing KV - compression methods perform well in natural - language - processing (NLP) tasks, they have limitations when directly applied to code - generation, especially when dealing with long code fragments. These methods may cause the model to be unable to accurately capture key information, thus affecting the quality of code - generation. Specifically, the paper points out that current KV - compression methods tend to make the model focus on local information, which is particularly disadvantageous in code - generation tasks because code itself has complex long - distance dependencies. For example, when generating code, the model needs to consider not only the content of the current file, but also external files such as imported packages, source - code files in the same directory, configuration files, and even API documents. There may be their own dependencies among these files, so relying solely on local - context information is not enough. To address these problems, the paper proposes a new method - AnchorCoder. AnchorCoder reduces the need for KV - cache while maintaining model performance by introducing "anchors" to aggregate and compress context information. Specifically, AnchorCoder takes advantage of the sparsity characteristics of the attention - weight distribution in the code - generation model. By inserting artificially - defined anchors in each line of code and training these anchors as aggregators of context information, the model can perform effective attention operations on fewer KV states, thereby significantly reducing memory overhead. Experimental results show that AnchorCoder can maintain or even improve model performance on multiple benchmark datasets while reducing KV - cache requirements by at least 70%, demonstrating its effectiveness and universality.

Anchor Attention, Small Cache: Code Generation with Large Language Models

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Anchor-based Large Language Models

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

Beyond KV Caching: Shared Attention for Efficient LLMs

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

SnapKV: LLM Knows What You are Looking for Before Generation

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

Unifying KV Cache Compression for Large Language Models with LeanKV

LoMA: Lossless Compressed Memory Attention

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Efficient Memory Management for Large Language Model Serving with PagedAttention

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention

Open-AI model Efficient Memory Reduce Management for the Large Language Models (LLMs) Serving with Paged Attention of sharing the KV Cashes

Efficient LLM Inference with Kcache

D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models