SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

Huizheng Wang,Jiahao Fang,Xinru Tang,Zhiheng Yue,Jinxi Li,Yubin Qin,Sihan Guan,Qize Yang,Yang Wang,Chao Li,Yang Hu,Shouyi Yin
2024-07-15
Abstract:Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to tackle the challenges posed by LTPP of Transformer inference effectively. We first propose a novel leading zero computing paradigm, which predicts attention sparsity by using log-based add-only operations to avoid the significant overhead of prediction. Then, a distributed sorting and a sorted updating FlashAttention mechanism are proposed with a cross-stage coordinated tiling principle, which enables fine-grained and lightweight coordination among stages, helping optimize memory access and latency. Further, we propose a SOFA accelerator to support these optimizations efficiently. Extensive experiments on 20 benchmarks show that SOFA achieves $9.5\times$ speed up and $71.5\times$ higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators, SOFA achieves an average $15.8\times$ energy efficiency, $10.3\times$ area efficiency and $9.3\times$ speed up, respectively.
Hardware Architecture
What problem does this paper attempt to address?
The paper aims to address the high latency and low energy efficiency issues faced by large language models (LLMs) when processing long sequences, particularly focusing on the computational and memory bottlenecks in the self-attention mechanism. Specifically, existing dynamic sparse accelerators face the following three major challenges when handling large-scale token parallel processing (LTPP): 1. **High computational complexity in the prediction phase**: When processing a large number of tokens, the computational load required for the pre-computation phase and Top-k sorting phase increases significantly, leading to excessive prediction overhead that can outweigh the benefits of sparse acceleration. 2. **Low memory access efficiency**: Since existing methods need to process data row by row, intermediate results must be stored in DRAM, resulting in a large amount of memory access overhead, which in turn affects inference efficiency. 3. **High computational cost of FlashAttention**: Although FlashAttention reduces memory access through a chunking strategy, this method increases computational costs, making it less suitable in LTPP scenarios. To address the above issues, the authors propose SOFA, a cross-stage coordinated computation-memory efficient algorithm and hardware co-design. SOFA addresses the above challenges through the following three key techniques: - **Differential Leading Zero Summation (DLZS)**: Reduces the overhead in the pre-computation phase by using a multiplication-free logarithmic domain computation paradigm to predict attention sparsity. - **Spherical Search Assisted Distributed Sorting (SADS)**: Segments long sequences into sub-segments for independent chunk sorting to reduce the total number of comparisons. - **Sorted Update FlashAttention (SU-FA)**: Optimizes attention computation by leveraging cross-stage sorting information, thereby reducing computational load. These techniques collectively act on a cross-stage fine-grained pipelined data flow, enabling SOFA to significantly improve speed and energy efficiency when processing long sequences. Experimental results show that SOFA achieves an average of 9.5 times speedup and 71.5 times energy efficiency improvement compared to the Nvidia A100 GPU across 20 benchmark tests.