Abstract:Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a 24x8 systolic array MatMul accelerator, and a novel accelerator for Transformer softmax and GELU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121x speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). In 12nm technology, SoftEx occupies 0.039 mm$^2$, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to 10.8x and 5.11x, respectively, while reducing their energy consumption by up to 10.8x and 5.29x. These enhancements translate into a 1.58x increase in throughput (310 GOPS at 0.8V) and a 1.42x improvement in energy efficiency (1.34 TOPS/W at 0.55V) on end-to-end ViT inference workloads.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficient deployment of the Transformer model on edge devices, specifically focusing on how to accelerate the nonlinear functions (such as softmax and GELU) in the Transformer to overcome the problem that these functions become performance bottlenecks during hardware acceleration. The following are the core problems of the paper and their solutions: ### 1. **Problem Description** The Transformer model has achieved remarkable results in fields such as natural language processing, computer vision, and audio processing, but its complexity and computational requirements also pose challenges. In particular, nonlinear functions such as softmax and GELU can become performance bottlenecks when hardware - accelerating matrix multiplication (MatMul). Although most of the Transformer's computations are dominated by matrix multiplication, performing these complex nonlinear functions on edge devices is still very time - consuming and energy - intensive. ### 2. **Objective** The objective of the paper is to design a flexible hardware - acceleration template that can significantly accelerate the computation of nonlinear functions in the Transformer model while maintaining high precision, thereby achieving efficient edge - side inference. Specifically, the author hopes to: - Provide a fast and hardware - friendly exponential function approximation method. - Design a hardware accelerator (SoftEx) specifically for accelerating softmax and GELU. - Integrate SoftEx into a multi - core heterogeneous cluster to improve overall performance and energy efficiency. ### 3. **Solution** To solve the above problems, the paper proposes the following innovations: - **Exponential Function Approximation (expp)**: Based on Schraudolph's method, improvements are made by introducing polynomial correction to improve the accuracy of exponential function calculations under BF16 precision. - **SoftEx Accelerator**: This is a hardware accelerator specifically designed for softmax and GELU. Using the above - improved exponential function approximation method, it achieves an acceleration of up to 10.8 times and a significant reduction in energy consumption. - **Multi - core Heterogeneous Cluster**: Integrate SoftEx into a heterogeneous cluster containing 8 RISC - V cores and 24×8 tensor processing units, enabling the entire system to perform Transformer inference tasks more efficiently. ### 4. **Experimental Results** Through the hardware design implemented in the 12 - nm process, the SoftEx accelerator only accounts for 3.22% of the cluster area, but it can achieve a 1.58 - fold throughput increase and a 1.42 - fold energy - efficiency increase in ViT inference tasks. In addition, compared to the optimized software implementation, SoftEx achieves accelerations of 10.8 times and 5.11 times respectively when calculating softmax and GELU, and significantly reduces energy consumption. ### Summary The core of the paper is to solve the performance bottleneck problem when the Transformer model is deployed on edge devices by hardware - accelerating nonlinear functions (such as softmax and GELU), thereby enabling edge devices to perform generative AI tasks more efficiently.

A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU

Reusing Softmax Hardware Unit for GELU Computation in Transformers

H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers

OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs

Research on LLM Acceleration Using the High-Performance RISC-V Processor "Xiangshan" (Nanhu Version) Based on the Open-Source Matrix Instruction Set Extension (Vector Dot Product)

HUGE2: a Highly Untangled Generative-model Engine for Edge-computing

A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models

PF‐GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT‐2 inference

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms

X-Former: In-Memory Acceleration of Transformers

Co-Designing Binarized Transformer and Hardware Accelerator for Efficient End-to-End Edge Deployment