Abstract:Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive computational and memory requirements in the reasoning process of large language models (LLMs), especially in the crucial attention mechanism. Specifically: 1. **Computational and Memory Efficiency**: Large language models require a large amount of computational resources and memory during reasoning. Especially when dealing with long - context, the occupancy of the KV cache will increase significantly, resulting in quadratic growth in computational complexity and memory consumption. 2. **Limitations of Existing Technologies**: - **Quantization Techniques**: Although some quantization methods (such as Atom, QuaRot, etc.) can reduce memory occupancy and improve the efficiency of linear operations, they are usually only applicable to linear components (such as QKV projection and FFN) and do not solve the problem of quantization execution in the attention mechanism. - **Acceleration Algorithms**: Acceleration algorithms such as FlashAttention improve execution efficiency, but rely on high - precision formats (FP16/32), which will cause high latency in long - context. - **KV Cache Compression**: Existing KV cache compression techniques (such as KIVI, GEAR, etc.) reduce memory bandwidth, but still require floating - point de - quantization when performing attention operations, adding extra overhead. 3. **Need for a Comprehensive Solution**: In order to solve the computational and memory efficiency problems simultaneously and avoid the limitations of existing methods, an attention mechanism that can achieve quantization execution is required, which can compress the KV cache and perform matrix multiplication and softmax calculation efficiently. ### Proposed Solution The paper proposes TurboAttention, a comprehensive method that achieves a quantization - executable attention mechanism through the following innovations: - **FlashQ**: A head - block quantization technique that can support low - precision integer activation - activation multiplication operations while compressing the KV cache. - **Sparsity - based Softmax Approximation (SAS)**: A softmax approximation method based on sparsity that eliminates the need for FP32 de - quantization in exponential operations. Through these innovations, TurboAttention not only reduces the size of the KV cache (by more than 4.4 times), but also improves the speed of the attention mechanism (by 1.2 - 1.8 times), and achieves higher throughput (up to 2.37 times) on multiple datasets and models, while maintaining almost lossless accuracy. ### Formula Representation 1. **Multi - Head Attention Mechanism**: \[ MHA(X)=Concat(H^{(1)}, \ldots, H^{(H)}) W_{o} \] where, \[ H^{(h)}=\text{Softmax}\left(\frac{Q^{(h)} {K^{(h)}}^T}{\sqrt{d_H}}\right) V^{(h)} \] \[ Q^{(h)} = X W_{q}^h, \quad K^{(h)}=X W_{k}^h, \quad V^{(h)}=X W_{v}^h \] 2. **Block - Progressive Quantization of FlashQ**: \[ X_{q1}=\text{Quant8}_{sym}(X) \] \[ K_{q2}^g=\text{Quant4/2}_{asym}(K_{q1}^g), \quad V_{q2}^g=\text{Quant4/2}_{asym}(V_{q1}^g) \] 3. **Exponential Approximation of SAS**: \[ e^{-x}=e^{-x_{int}} \times e^{-x_{dec}} \approx LUT(-x_{int}) \times POLY(-x_{dec}) \] \[ SAS(x)= \begin{cases} 0 & \text{if } x < n_r \\ LUT(x_{int}) \end{cases}

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization

AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Eigen Attention: Attention in Low-Rank Space for KV Cache Compression

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Beyond KV Caching: Shared Attention for Efficient LLMs

Squeezed Attention: Accelerating Long Context Length LLM Inference

Fast Quantum Algorithm for Attention Computation

KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity