Hao Kang,Srikant Bharadwaj,James Hensman,Tushar Krishna,Victor Ruhle,Saravan Rajmohan
Abstract:Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation.
We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of excessive computational and memory requirements in the reasoning process of large language models (LLMs), especially in the crucial attention mechanism. Specifically:
1. **Computational and Memory Efficiency**: Large language models require a large amount of computational resources and memory during reasoning. Especially when dealing with long - context, the occupancy of the KV cache will increase significantly, resulting in quadratic growth in computational complexity and memory consumption.
2. **Limitations of Existing Technologies**:
- **Quantization Techniques**: Although some quantization methods (such as Atom, QuaRot, etc.) can reduce memory occupancy and improve the efficiency of linear operations, they are usually only applicable to linear components (such as QKV projection and FFN) and do not solve the problem of quantization execution in the attention mechanism.
- **Acceleration Algorithms**: Acceleration algorithms such as FlashAttention improve execution efficiency, but rely on high - precision formats (FP16/32), which will cause high latency in long - context.
- **KV Cache Compression**: Existing KV cache compression techniques (such as KIVI, GEAR, etc.) reduce memory bandwidth, but still require floating - point de - quantization when performing attention operations, adding extra overhead.
3. **Need for a Comprehensive Solution**: In order to solve the computational and memory efficiency problems simultaneously and avoid the limitations of existing methods, an attention mechanism that can achieve quantization execution is required, which can compress the KV cache and perform matrix multiplication and softmax calculation efficiently.
### Proposed Solution
The paper proposes TurboAttention, a comprehensive method that achieves a quantization - executable attention mechanism through the following innovations:
- **FlashQ**: A head - block quantization technique that can support low - precision integer activation - activation multiplication operations while compressing the KV cache.
- **Sparsity - based Softmax Approximation (SAS)**: A softmax approximation method based on sparsity that eliminates the need for FP32 de - quantization in exponential operations.
Through these innovations, TurboAttention not only reduces the size of the KV cache (by more than 4.4 times), but also improves the speed of the attention mechanism (by 1.2 - 1.8 times), and achieves higher throughput (up to 2.37 times) on multiple datasets and models, while maintaining almost lossless accuracy.
### Formula Representation
1. **Multi - Head Attention Mechanism**:
\[
MHA(X)=Concat(H^{(1)}, \ldots, H^{(H)}) W_{o}
\]
where,
\[
H^{(h)}=\text{Softmax}\left(\frac{Q^{(h)} {K^{(h)}}^T}{\sqrt{d_H}}\right) V^{(h)}
\]
\[
Q^{(h)} = X W_{q}^h, \quad K^{(h)}=X W_{k}^h, \quad V^{(h)}=X W_{v}^h
\]
2. **Block - Progressive Quantization of FlashQ**:
\[
X_{q1}=\text{Quant8}_{sym}(X)
\]
\[
K_{q2}^g=\text{Quant4/2}_{asym}(K_{q1}^g), \quad V_{q2}^g=\text{Quant4/2}_{asym}(V_{q1}^g)
\]
3. **Exponential Approximation of SAS**:
\[
e^{-x}=e^{-x_{int}} \times e^{-x_{dec}} \approx LUT(-x_{int}) \times POLY(-x_{dec})
\]
\[
SAS(x)=
\begin{cases}
0 & \text{if } x < n_r \\
LUT(x_{int})
\end{cases}