Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

Mukul Lokhande,Gopal Raut,Santosh Kumar Vishvakarma
2024-12-16
Abstract:The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.
Hardware Architecture,Computer Vision and Pattern Recognition,Distributed, Parallel, and Cluster Computing,Image and Video Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiency of existing hardware accelerators in supporting multiple precisions and run - time configurable non - linear activation functions (AF). Specifically: 1. **Simultaneously support diverse precisions and run - time configurable activation functions**: Existing solutions either support diverse precisions or run - time configurable activation functions, but cannot meet both requirements simultaneously. This limits the flexibility and efficiency of hardware accelerators in edge AI and high - performance computing (HPC) workloads. 2. **Improve throughput and resource utilization**: To meet the computational resource requirements of deep - learning inference, training, Vision Transformer (ViTs) and other HPC applications, a hardware design that can operate efficiently in different precision modes and support multiple activation functions is required. Existing hardware designs have bottlenecks in this regard, especially when dealing with large - scale parallel tasks. 3. **Reduce the pressure on memory bandwidth and interconnect bandwidth**: As the scale of AI models increases, memory access and data transfer become the main bottlenecks of system performance. Therefore, a hardware architecture that can reduce the number of DMA reads, optimize memory usage and improve energy efficiency is required. ### Solutions proposed in the paper The paper proposes a flexible and SIMD multi - precision processing unit (Flex - PE), with the following main features: - **Support for diverse activation functions**: Including common activation functions such as Sigmoid, Tanh, ReLU and Softmax. - **Multi - precision support**: It can dynamically switch between 4 - bit, 8 - bit, 16 - bit and 32 - bit fixed - point precisions to adapt to different application scenarios. - **High throughput**: Through time - multiplexing technology and pipeline design, throughputs of 16×FxP4, 8×FxP8, 4×FxP16 and 1×FxP32 are achieved. - **Efficient SIMD architecture**: Utilizing the CORDIC algorithm and a 5 - stage SIMD logarithmic barrel shifter (LBS), flexible multi - precision operations and activation function calculations are achieved. ### Main contributions 1. **Flexible multi - precision configurable activation function (AF)**: - Supports multiple activation functions (Tanh, ReLU, Softmax, Sigmoid) and multi - precision (FxP4/8/16/32) operations. - Achieves almost 100% hardware utilization through time - multiplexing and improves throughput. 2. **Flexible and SIMD multi - precision CORDIC - based processing unit (Flex - PE)**: - Uses the CORDIC method for MAC and multiple activation function operations, providing a configurable and efficient AI core. - Has an iterative mode (suitable for resource - constrained environments) and a pipeline mode (suitable for HPC applications), achieving a trade - off between latency and performance. 3. **Performance - enhanced SIMD systolic array analysis**: - Evaluates the performance of run - time adaptation to different precision levels in edge applications, achieving an energy efficiency of 8.42 GOPS/W with a precision loss within 2%. - For the VGG - 16 model, reduces the number of DMA reads of input feature maps and weight filters by 62 times and 371 times respectively. These improvements not only enhance the adaptability and resource efficiency of the hardware, but also make it more suitable for the complex requirements of AI applications.