Abstract:The systolic accelerator is one of the premier architectural choices for DNN acceleration. However, the conventional systolic architecture suffers from low PE utilization due to the mismatch between the fixed array and diverse DNN workloads. Recent studies have proposed flexible systolic array architectures to adapt to DNN models. However, these designs support only coarse-grained reshaping or significantly increase hardware overhead. In this study, we propose ReDas, a flexible and lightweight systolic array that supports dynamic fine-grained reshaping and multiple dataflows. First, ReDas integrates lightweight and reconfigurable roundabout data paths, which achieve fine-grained reshaping using only short connections between adjacent PEs. Second, we redesign the PE microarchitecture and integrate a set of multi-mode data buffers around the array. The PE structure enables additional data bypassing and flexible data switching. Simultaneously, the multi-mode buffers facilitate fine-grained reallocation of on-chip memory resources, adapting to various dataflow requirements. ReDas can dynamically reconfigure to up to 129 different logical shapes and 3 dataflows for a 128x128 array. Finally, we propose an efficient mapper to generate appropriate configurations for each layer of DNN workloads. Compared to the conventional systolic array, ReDas can achieve about 4.6x speedup and 8.3x energy-delay product (EDP) reduction.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the low utilization rate of processing elements (PEs) in the acceleration of deep neural networks (DNNs) caused by the fixed shape and single data flow in the traditional systolic array architecture. Specifically, the paper points out that due to the significant differences in computational characteristics and shapes between different DNN models and their layers, the PE utilization rate of the traditional fixed - systolic array is usually very low when processing these heterogeneous workloads, especially when processing specific types of layers such as long - short - term memory (LSTM) and deep convolutional layers. This inefficient utilization not only wastes hardware resources but also limits the overall performance and energy efficiency of the system. To overcome this challenge, the paper proposes a flexible and lightweight systolic array architecture named ReDas, which supports dynamic fine - grained reshaping and multiple data flows. ReDas achieves efficient adaptation to different DNN models by introducing a reconfigurable ring - shaped data path and multi - mode buffers, as well as redesigning the PE micro - architecture. Compared with existing methods, ReDas can provide higher PE utilization and better performance without significantly increasing hardware overhead. The main contributions of the paper include: 1. **Proposing a lightweight and reconfigurable ring - shaped data path that achieves fine - grained reshaping using short connections**. Compared with dedicated bypass data paths, the shared ring - shaped data path shows better scalability and lower overhead. 2. **Introducing an efficient systolic array architecture, ReDas**. By allowing data to move along two dimensions, ReDas can flexibly support fine - grained reshaping and multiple data flows. 3. **Proposing a mapping strategy, ReDas Mapper, to adapt to various DNN models**. This mapper uses a detailed analysis model and interval sampling to search for suitable hardware configurations and workload mappings. Through these innovations, when processing multiple DNN models, ReDas can achieve approximately a 4.6 - fold speed - up and an 8.3 - fold reduction in the energy - delay product (EDP) compared with the traditional systolic array architecture. This indicates that ReDas has significant advantages in improving the flexibility, efficiency, and cost - effectiveness of DNN accelerators.

ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array

ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors

Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Generating Systolic Array Accelerators with Reusable Blocks

ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNN Inference

A Reconfigurable Computing-in-Memory Accelerator with Dynamic Group-Based Dataflow and Dual-Input Macro Designs

ONE-SA: Enabling Nonlinear Operations in Systolic Arrays for Efficient and Flexible Neural Network Inference

On the Difficulty of Designing Processor Arrays for Deep Neural Networks

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training

ARAS: An Adaptive Low-Cost ReRAM-Based Accelerator for DNNs

Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration

A Reduced Architecture for ReRAM-Based Neural Network Accelerator and Its Software Stack

FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator

DSA-CNN: an fpga-integrated deformable systolic array for convolutional neural network acceleration

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

DReAC:A Novel Dynamically Reconfigurable Co-Processor

HReA: an Energy-Efficient Embedded Dynamically Reconfigurable Fabric for 13-Dwarfs Processing

Dynamic Resource Partitioning for Multi-Tenant Systolic Array Based DNN Accelerator