Abstract:The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiency of existing hardware accelerators in supporting multiple precisions and run - time configurable non - linear activation functions (AF). Specifically: 1. **Simultaneously support diverse precisions and run - time configurable activation functions**: Existing solutions either support diverse precisions or run - time configurable activation functions, but cannot meet both requirements simultaneously. This limits the flexibility and efficiency of hardware accelerators in edge AI and high - performance computing (HPC) workloads. 2. **Improve throughput and resource utilization**: To meet the computational resource requirements of deep - learning inference, training, Vision Transformer (ViTs) and other HPC applications, a hardware design that can operate efficiently in different precision modes and support multiple activation functions is required. Existing hardware designs have bottlenecks in this regard, especially when dealing with large - scale parallel tasks. 3. **Reduce the pressure on memory bandwidth and interconnect bandwidth**: As the scale of AI models increases, memory access and data transfer become the main bottlenecks of system performance. Therefore, a hardware architecture that can reduce the number of DMA reads, optimize memory usage and improve energy efficiency is required. ### Solutions proposed in the paper The paper proposes a flexible and SIMD multi - precision processing unit (Flex - PE), with the following main features: - **Support for diverse activation functions**: Including common activation functions such as Sigmoid, Tanh, ReLU and Softmax. - **Multi - precision support**: It can dynamically switch between 4 - bit, 8 - bit, 16 - bit and 32 - bit fixed - point precisions to adapt to different application scenarios. - **High throughput**: Through time - multiplexing technology and pipeline design, throughputs of 16×FxP4, 8×FxP8, 4×FxP16 and 1×FxP32 are achieved. - **Efficient SIMD architecture**: Utilizing the CORDIC algorithm and a 5 - stage SIMD logarithmic barrel shifter (LBS), flexible multi - precision operations and activation function calculations are achieved. ### Main contributions 1. **Flexible multi - precision configurable activation function (AF)**: - Supports multiple activation functions (Tanh, ReLU, Softmax, Sigmoid) and multi - precision (FxP4/8/16/32) operations. - Achieves almost 100% hardware utilization through time - multiplexing and improves throughput. 2. **Flexible and SIMD multi - precision CORDIC - based processing unit (Flex - PE)**: - Uses the CORDIC method for MAC and multiple activation function operations, providing a configurable and efficient AI core. - Has an iterative mode (suitable for resource - constrained environments) and a pipeline mode (suitable for HPC applications), achieving a trade - off between latency and performance. 3. **Performance - enhanced SIMD systolic array analysis**: - Evaluates the performance of run - time adaptation to different precision levels in edge applications, achieving an energy efficiency of 8.42 GOPS/W with a precision loss within 2%. - For the VGG - 16 model, reduces the number of DMA reads of input feature maps and weight filters by 62 times and 371 times respectively. These improvements not only enhance the adaptability and resource efficiency of the hardware, but also make it more suitable for the complex requirements of AI applications.

Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

Optimally Approximated and Unbiased Floating-Point Multiplier with Runtime Configurability

A Low-Cost Floating-Point FMA Unit Supporting Package Operations for HPC-AI Applications

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Flexible Acceleration Framework for Dense/Sparse Matrix Multiplication on Versal ACAP

A 28nm 1.07TFLOPS/mm<sup>2</sup> Dynamic-Precision Training Processor with Online Dynamic Execution and Multi- Level-Aligned Block-FP Processing

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

F1 - Striking the Balance Between Energy Efficiency & Flexibility - General-Purpose vs Special-Purpose ML Processors.

Hybrid Precision Floating-Point (HPFP) Selection to Optimize Hardware-Constrained Accelerator for CNN Training

Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

A 28nm 128TFLOPS/W Computing-In-Memory Engine Supporting One-Shot Floating-Point NN Inference and On-Device Fine-Tuning for Edge AI

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support

All-rounder: A flexible DNN accelerator with diverse data format support

Optimal Architecture of Floating-Point Arithmetic for Neural Network Training Processors

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing