Abstract:Convolution computation is a common operation in deep neural networks (DNNs) and is often responsible for performance bottlenecks during training and inferencing. Existing approaches for accelerating convolution operations aim to reduce computational complexity. However, these strategies often increase the memory footprint with extra memory accesses, thereby leaving much room for performance improvement. This paper presents a novel approach to optimize memory access for convolution operations, specifically targeting GPU execution. Our approach leverages two optimization techniques to reduce the number of memory operations for convolution operations performed on the width and height dimensions. For convolution computations on the width dimension, we exploit shuffle instructions to exchange the overlapped columns of the input for reducing the number of memory transactions. For convolution operations on the height dimension, we multiply each overlapped row of the input with multiple rows of a filter to compute multiple output elements to improve the data locality of row elements. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to speedups over the quickest algorithm of cuDNN. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over 2× faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to 1.3× speedups over the quickest algorithm of cuDNN.

Characterizing and Demystifying the Implicit Convolution Algorithm on Commercial Matrix-Multiplication Accelerators

A design framework for processing-in-memory accelerator

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

High Performance Zero-Memory Overhead Direct Convolutions

The Indirect Convolution Algorithm

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

Parallel GEMM-based convolution for deep learning on multicore RISC-V processors

Optimizing GPU Memory Transactions for Convolution Operations

GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores

MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

CUTE: A scalable CPU-Centric and ultra-utilized tensor engine for convolutions

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective

A Convolution Neural Network Accelerator Design with Weight Mapping and Pipeline Optimization

Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs