Abstract:Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss. To address these issues and achieve high computational efficiency, this paper proposes an efficient CNNmapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion. Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.

A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator.

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Sensitivity-Oriented Layer-Wise Acceleration and Compression for Convolutional Neural Network.

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

A High-Performance Reconfigurable Accelerator for Convolutional Neural Networks.

An Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks

Convolution Without Multiplication: A General Speed Up Strategy for CNNs

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

A High-Efficient and Configurable Hardware Accelerator for Convolutional Neural Network

A High Efficient Architecture for Convolution Neural Network Accelerator

A Mixed-Pruning Based Framework for Embedded Convolutional Neural Network Acceleration.

A Power-Efficient Accelerator for Convolutional Neural Networks

High-performance Convolutional Neural Network Accelerator Based on Systolic Arrays and Quantization

Efficient Deep Convolutional Neural Networks Accelerator Without Multiplication and Retraining

Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy

High Energy Efficiency FPGA-Based Accelerator for Convolutional Neural Networks Using Weight Combination

Optimizing Convolutional Neural Networks on Multi-Core Vector Accelerator

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning.