GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

Chenyang Ai,Lechuan Zhao,Zhijie Huang,Cangyuan Li,Xinan Wang,Ying Wang
2024-05-03
Abstract:Recently, tensor algebra have witnessed significant applications across various domains. Each operator in tensor algebra features different computational workload and precision. However, current general accelerators, such as VPU, GPGPU, and CGRA, support tensor operators with low energy and area efficiency. This paper conducts an in-depth exploration of general accelerator for tensor processing.
Hardware Architecture
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the low area efficiency and energy efficiency of current general - purpose accelerators when processing tensor operations. Specifically, the paper points out that existing general - purpose accelerators such as Vector Processing Units (VPU), General - Purpose Graphics Processing Units (GPGPU), and Configurable Coarse - Grained Architectures (CGRA) can support tensor operations, but they show low energy efficiency and area efficiency when processing these operations. This is mainly because these accelerators perform poorly in data reuse, resulting in a large number of memory accesses, and at the same time, they have insufficient support for tensor operations of different precisions, resulting in low utilization of hardware resources. To solve these problems, the paper proposes a new General - purpose Tensor Accelerator (GTA), aiming to improve the area efficiency and data reuse rate of processing tensor operations by introducing the Multi - Precision Reconfigurable Array (MPRA) and an improved data - flow scheduling strategy. The design of GTA takes into account the requirements of different computational loads and precisions, and can achieve higher performance and energy efficiency in various application scenarios. The main contributions of the paper include: 1. Discovering the similarity between matrix multiplication and precision multiplication, and based on this, proposing a classification method for tensor operations. 2. Designing the Multi - Precision Reconfigurable Array (MPRA) and implementing MPRA in the vector architecture, enabling GTA to handle tensor operations with arbitrary computational loads and precisions. 3. Implementing a general tensor scheduling optimization strategy based on data - flow, precision, and array - size adjustment, and analyzing the scheduling space. According to the evaluation results, compared with VPU (Ara), GPGPU (NVIDIA H100), and CGRA (hycube), GTA improves memory efficiency by 7.76 times, 5.35 times, and 8.76 times respectively, and improves speed by 6.45 times, 3.39 times, and 25.83 times respectively.