What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the low area efficiency and energy efficiency of current general - purpose accelerators when processing tensor operations. Specifically, the paper points out that existing general - purpose accelerators such as Vector Processing Units (VPU), General - Purpose Graphics Processing Units (GPGPU), and Configurable Coarse - Grained Architectures (CGRA) can support tensor operations, but they show low energy efficiency and area efficiency when processing these operations. This is mainly because these accelerators perform poorly in data reuse, resulting in a large number of memory accesses, and at the same time, they have insufficient support for tensor operations of different precisions, resulting in low utilization of hardware resources. To solve these problems, the paper proposes a new General - purpose Tensor Accelerator (GTA), aiming to improve the area efficiency and data reuse rate of processing tensor operations by introducing the Multi - Precision Reconfigurable Array (MPRA) and an improved data - flow scheduling strategy. The design of GTA takes into account the requirements of different computational loads and precisions, and can achieve higher performance and energy efficiency in various application scenarios. The main contributions of the paper include: 1. Discovering the similarity between matrix multiplication and precision multiplication, and based on this, proposing a classification method for tensor operations. 2. Designing the Multi - Precision Reconfigurable Array (MPRA) and implementing MPRA in the vector architecture, enabling GTA to handle tensor operations with arbitrary computational loads and precisions. 3. Implementing a general tensor scheduling optimization strategy based on data - flow, precision, and array - size adjustment, and analyzing the scheduling space. According to the evaluation results, compared with VPU (Ara), GPGPU (NVIDIA H100), and CGRA (hycube), GTA improves memory efficiency by 7.76 times, 5.35 times, and 8.76 times respectively, and improves speed by 6.45 times, 3.39 times, and 25.83 times respectively.

GTA: a new General Tensor Accelerator with Better Area Efficiency and Data Reuse

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Automatic Generation of Spatial Accelerator for Tensor Algebra

TensorLib - A Spatial Accelerator Generation Framework for Tensor Algebra.

DyGA: A Hardware-Efficient Accelerator with Traffic-Aware Dynamic Scheduling for Graph Convolutional Networks.

TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing

EN-T: Optimizing Tensor Computing Engines Performance via Encoder-Based Methodology

High-Performance Generalized Tensor Operations

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning

GPTPU: Accelerating Applications using Edge Tensor Processing Units

Ayaka: A Versatile Transformer Accelerator with Low-Rank Estimation and Heterogeneous Dataflow

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU