Abstract:Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss. To address these issues and achieve high computational efficiency, this paper proposes an efficient CNNmapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion. Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

DaDianNao: A Machine-Learning Supercomputer

Toward matrix multiplication for deep learning inference on the Xilinx Versal

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration

A Data-Driven Asynchronous Neural Network Accelerator

Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

Accelerating Graph Neural Networks with a Novel Matrix Compression Format

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

ONE-SA: Enabling Nonlinear Operations in Systolic Arrays for Efficient and Flexible Neural Network Inference

Optimizing Convolutional Neural Networks on Multi-Core Vector Accelerator

An architecture-level analysis on deep learning models for low-impact computations

Neural Synaptic Plasticity-Inspired Computing: A High Computing Efficient Deep Convolutional Neural Network Accelerator

Systolic Array Data Flows for Efficient Matrix Multiplication in Deep Neural Networks

LHC: A Low-Power Heterogeneous Computing Method on Neural Network Accelerator

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

AI Matrix - Synthetic Benchmarks for DNN

Hardware Accelerator Design for Sparse DNN Inference and Training: A Tutorial

G-NMP: Accelerating Graph Neural Networks with DIMM-based Near-Memory Processing