Abstract:Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token parallel processing (LTPP). However, existing dynamic sparse accelerators struggle to effectively handle LTPP, as they solely focus on separate stage optimization, and with most efforts confined to computational enhancements. By re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an ever-overlooked opportunity that the LTPP can exploit the intrinsic coordination among stages to avoid excessive memory access and redundant computation. Motivated by our observation, we present SOFA, a cross-stage compute-memory efficient algorithm-hardware co-design, which is tailored to tackle the challenges posed by LTPP of Transformer inference effectively. We first propose a novel leading zero computing paradigm, which predicts attention sparsity by using log-based add-only operations to avoid the significant overhead of prediction. Then, a distributed sorting and a sorted updating FlashAttention mechanism are proposed with a cross-stage coordinated tiling principle, which enables fine-grained and lightweight coordination among stages, helping optimize memory access and latency. Further, we propose a SOFA accelerator to support these optimizations efficiently. Extensive experiments on 20 benchmarks show that SOFA achieves $9.5\times$ speed up and $71.5\times$ higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators, SOFA achieves an average $15.8\times$ energy efficiency, $10.3\times$ area efficiency and $9.3\times$ speed up, respectively.

What problem does this paper attempt to address?

The paper aims to address the high latency and low energy efficiency issues faced by large language models (LLMs) when processing long sequences, particularly focusing on the computational and memory bottlenecks in the self-attention mechanism. Specifically, existing dynamic sparse accelerators face the following three major challenges when handling large-scale token parallel processing (LTPP): 1. **High computational complexity in the prediction phase**: When processing a large number of tokens, the computational load required for the pre-computation phase and Top-k sorting phase increases significantly, leading to excessive prediction overhead that can outweigh the benefits of sparse acceleration. 2. **Low memory access efficiency**: Since existing methods need to process data row by row, intermediate results must be stored in DRAM, resulting in a large amount of memory access overhead, which in turn affects inference efficiency. 3. **High computational cost of FlashAttention**: Although FlashAttention reduces memory access through a chunking strategy, this method increases computational costs, making it less suitable in LTPP scenarios. To address the above issues, the authors propose SOFA, a cross-stage coordinated computation-memory efficient algorithm and hardware co-design. SOFA addresses the above challenges through the following three key techniques: - **Differential Leading Zero Summation (DLZS)**: Reduces the overhead in the pre-computation phase by using a multiplication-free logarithmic domain computation paradigm to predict attention sparsity. - **Spherical Search Assisted Distributed Sorting (SADS)**: Segments long sequences into sub-segments for independent chunk sorting to reduce the total number of comparisons. - **Sorted Update FlashAttention (SU-FA)**: Optimizes attention computation by leveraging cross-stage sorting information, thereby reducing computational load. These techniques collectively act on a cross-stage fine-grained pipelined data flow, enabling SOFA to significantly improve speed and energy efficiency when processing long sequences. Experimental results show that SOFA achieves an average of 9.5 times speedup and 71.5 times energy efficiency improvement compared to the Nvidia A100 GPU across 20 benchmark tests.

SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling

Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms

FACT: FFN-Attention Co-optimized Transformer Architecture with Eager Correlation Prediction.

SALO: an Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

COSA Plus: Enhanced Co-Operative Systolic Arrays for Attention Mechanism in Transformers

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Eyelet: A Cross-Mesh NoC-Based Fine-Grained Sparse CNN Accelerator for Spatio-Temporal Parallel Computing Optimization

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

FSA: A Fine-Grained Systolic Accelerator for Sparse CNNs

Multilayer Dataflow: Orchestrate Butterfly Sparsity to Accelerate Attention Computation