ReDas: A Lightweight Architecture for Supporting Fine-Grained Reshaping and Multiple Dataflows on Systolic Array

Meng Han,Liang Wang,Limin Xiao,Tianhao Cai,Zeyu Wang,Xiangrong Xu,Chenhao Zhang
DOI: https://doi.org/10.1109/TC.2024.3398500
2024-05-15
Abstract:The systolic accelerator is one of the premier architectural choices for DNN acceleration. However, the conventional systolic architecture suffers from low PE utilization due to the mismatch between the fixed array and diverse DNN workloads. Recent studies have proposed flexible systolic array architectures to adapt to DNN models. However, these designs support only coarse-grained reshaping or significantly increase hardware overhead. In this study, we propose ReDas, a flexible and lightweight systolic array that supports dynamic fine-grained reshaping and multiple dataflows. First, ReDas integrates lightweight and reconfigurable roundabout data paths, which achieve fine-grained reshaping using only short connections between adjacent PEs. Second, we redesign the PE microarchitecture and integrate a set of multi-mode data buffers around the array. The PE structure enables additional data bypassing and flexible data switching. Simultaneously, the multi-mode buffers facilitate fine-grained reallocation of on-chip memory resources, adapting to various dataflow requirements. ReDas can dynamically reconfigure to up to 129 different logical shapes and 3 dataflows for a 128x128 array. Finally, we propose an efficient mapper to generate appropriate configurations for each layer of DNN workloads. Compared to the conventional systolic array, ReDas can achieve about 4.6x speedup and 8.3x energy-delay product (EDP) reduction.
Hardware Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low utilization rate of processing elements (PEs) in the acceleration of deep neural networks (DNNs) caused by the fixed shape and single data flow in the traditional systolic array architecture. Specifically, the paper points out that due to the significant differences in computational characteristics and shapes between different DNN models and their layers, the PE utilization rate of the traditional fixed - systolic array is usually very low when processing these heterogeneous workloads, especially when processing specific types of layers such as long - short - term memory (LSTM) and deep convolutional layers. This inefficient utilization not only wastes hardware resources but also limits the overall performance and energy efficiency of the system. To overcome this challenge, the paper proposes a flexible and lightweight systolic array architecture named ReDas, which supports dynamic fine - grained reshaping and multiple data flows. ReDas achieves efficient adaptation to different DNN models by introducing a reconfigurable ring - shaped data path and multi - mode buffers, as well as redesigning the PE micro - architecture. Compared with existing methods, ReDas can provide higher PE utilization and better performance without significantly increasing hardware overhead. The main contributions of the paper include: 1. **Proposing a lightweight and reconfigurable ring - shaped data path that achieves fine - grained reshaping using short connections**. Compared with dedicated bypass data paths, the shared ring - shaped data path shows better scalability and lower overhead. 2. **Introducing an efficient systolic array architecture, ReDas**. By allowing data to move along two dimensions, ReDas can flexibly support fine - grained reshaping and multiple data flows. 3. **Proposing a mapping strategy, ReDas Mapper, to adapt to various DNN models**. This mapper uses a detailed analysis model and interval sampling to search for suitable hardware configurations and workload mappings. Through these innovations, when processing multiple DNN models, ReDas can achieve approximately a 4.6 - fold speed - up and an 8.3 - fold reduction in the energy - delay product (EDP) compared with the traditional systolic array architecture. This indicates that ReDas has significant advantages in improving the flexibility, efficiency, and cost - effectiveness of DNN accelerators.