Reduced-Precision Floating-Point Arithmetic in Systolic Arrays with Skewed Pipelines

D. Filippas,C. Peltekis,G. Dimitrakopoulos,C. Nicopoulos
DOI: https://doi.org/10.1109/AICAS57966.2023.10168556
2023-09-08
Abstract:The acceleration of deep-learning kernels in hardware relies on matrix multiplications that are executed efficiently on Systolic Arrays (SA). To effectively trade off deep-learning training/inference quality with hardware cost, SA accelerators employ reduced-precision Floating-Point (FP) arithmetic. In this work, we demonstrate the need for new pipeline organizations to reduce latency and improve energy efficiency of reduced-precision FP operators for the chained multiply-add operation imposed by the structure of the SA. The proposed skewed pipeline design reorganizes the pipelined operation of the FP multiply-add units to enable new forwarding paths for the exponent logic, which allow for parallel execution of the pipeline stages of consecutive PEs. As a result, the latency of the matrix multiplication operation within the SA is significantly reduced with minimal hardware cost, thereby yielding an energy reduction of 8% and 11% for the examined state-of-the-art CNNs.
Hardware Architecture
What problem does this paper attempt to address?
The paper primarily focuses on the multiply-accumulate operations in neural network hardware accelerators using floating-point arithmetic, specifically addressing issues in systolic array architectures that employ reduced-precision floating-point numbers. Specifically, the problems the paper attempts to solve include: 1. **Reducing Latency**: When performing matrix multiplication in a systolic array, the design limitations of traditional floating-point units restrict the parallelism between pipeline stages, leading to increased latency in the overall computation process. The paper proposes a new pipeline organization method to reduce this latency. 2. **Improving Energy Efficiency**: By reducing latency, the time required to complete a matrix multiplication is decreased, thereby reducing overall energy consumption. The paper demonstrates how the new design significantly reduces energy consumption with only a slight increase in hardware cost. 3. **Optimizing Pipeline Structure**: Considering the characteristics of reduced-precision floating-point formats, the paper proposes a reorganized pipeline architecture, namely the "tilted pipeline" design. This design allows pipeline stages between adjacent Processing Elements (PEs) to execute in parallel, thereby enhancing parallelism and computational efficiency. In summary, the paper aims to achieve lower latency and higher energy efficiency by improving the pipeline architecture of floating-point multiply-accumulate units used in systolic arrays for deep learning acceleration. This is mainly achieved by redesigning the internal logic of the floating-point units and introducing new data forwarding paths to eliminate dependencies between pipeline stages, thereby achieving these goals.