Reduced-Precision Floating-Point Arithmetic in Systolic Arrays with Skewed Pipelines

D. Filippas,C. Peltekis,G. Dimitrakopoulos,C. Nicopoulos

DOI: https://doi.org/10.1109/AICAS57966.2023.10168556

2023-09-08

Abstract:The acceleration of deep-learning kernels in hardware relies on matrix multiplications that are executed efficiently on Systolic Arrays (SA). To effectively trade off deep-learning training/inference quality with hardware cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic. In this work, we demonstrate the need for new pipeline organizations to reduce latency and improve energy efficiency of reduced-precision FP operators for the chained multiply-add operation imposed by the structure of the SA. The proposed skewed pipeline design reorganizes the pipelined operation of the FP multiply-add units to enable new forwarding paths for the exponent logic, which allow for parallel execution of the pipeline stages of consecutive PEs. As a result, the latency of the matrix multiplication operation within the SA is significantly reduced with minimal hardware cost, thereby yielding an energy reduction of 8% and 11% for the examined state-of-the-art CNNs.

Hardware Architecture

What problem does this paper attempt to address?

The paper primarily focuses on the multiply-accumulate operations in neural network hardware accelerators using floating-point arithmetic, specifically addressing issues in systolic array architectures that employ reduced-precision floating-point numbers. Specifically, the problems the paper attempts to solve include: 1. **Reducing Latency**: When performing matrix multiplication in a systolic array, the design limitations of traditional floating-point units restrict the parallelism between pipeline stages, leading to increased latency in the overall computation process. The paper proposes a new pipeline organization method to reduce this latency. 2. **Improving Energy Efficiency**: By reducing latency, the time required to complete a matrix multiplication is decreased, thereby reducing overall energy consumption. The paper demonstrates how the new design significantly reduces energy consumption with only a slight increase in hardware cost. 3. **Optimizing Pipeline Structure**: Considering the characteristics of reduced-precision floating-point formats, the paper proposes a reorganized pipeline architecture, namely the "tilted pipeline" design. This design allows pipeline stages between adjacent Processing Elements (PEs) to execute in parallel, thereby enhancing parallelism and computational efficiency. In summary, the paper aims to achieve lower latency and higher energy efficiency by improving the pipeline architecture of floating-point multiply-accumulate units used in systolic arrays for deep learning acceleration. This is mainly achieved by redesigning the internal logic of the floating-point units and introducing new data forwarding paths to eliminate dependencies between pipeline stages, thereby achieving these goals.

Reduced-Precision Floating-Point Arithmetic in Systolic Arrays with Skewed Pipelines

A Low Latency High Throughput Multiply-accumulator Unit for Float Point and Integer

Low-Power Data Streaming in Systolic Arrays with Bus-Invert Coding and Zero-Value Clock Gating

Floating-Point Multiply-Add with Approximate Normalization for Low-Cost Matrix Engines

ArrayFlex: A Systolic Array Architecture with Configurable Transparent Pipelining

Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

An Energy Efficient Soft SIMD Microarchitecture and Its Application on Quantized CNNs

Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators

A Gradient-Interleaved Scheduler for Energy-Efficient Backpropagation for Training Neural Networks

A Deep Learning Inference Scheme Based on Pipelined Matrix Multiplication Acceleration Design and Non-uniform Quantization

Systolic Array Data Flows for Efficient Matrix Multiplication in Deep Neural Networks

Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization

Rethinking Floating Point Overheads for Mixed Precision DNN Accelerators

Optimizing Stochastic Computing for Low Latency Inference of Convolutional Neural Networks

NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI

DeMM: A Decoupled Matrix Multiplication Engine Supporting Relaxed Structured Sparsity

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training