Abstract:Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the performance bottleneck. This paper explores fusing depthwise and pointwise convolutions to overcome the memory access bottleneck. The focus is on fusing these operators on GPUs. The prior art on GPU-based fusion suffers from one or more of the following: (1) fusing either a convolution with an element-wise or multiple non-convolutional operators, (2) not explicitly optimizing for memory accesses, (3) not supporting depthwise convolutions. This paper proposes Fused Convolutional Modules (FCMs), a set of novel fused depthwise and pointwise GPU kernels. FCMs significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency. To evaluate the trade-offs associated with fusion and determine which convolutions are beneficial to fuse and the optimal FCM parameters, we propose FusePlanner. FusePlanner consists of cost models to estimate the memory accesses of depthwise, pointwise, and FCM kernels given GPU characteristics. Our experiments on three GPUs using representative CNNs and ViTs demonstrate that FCMs save up to 83\% of the memory accesses and achieve speedups of up to 3.7x compared to cuDNN. Complete model implementations of various CNNs using our modules outperform TVMs' achieving speedups of up to 1.8x and saving up to two-thirds of the energy. FCM and FusePlanner implementations are open source: <a class="link-external link-https" href="https://github.com/fqararyah/Fusing_DW_and_PW_on_GPUs" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance bottleneck of Depthwise Separable Convolutions (DSC) on Graphics Processing Units (GPUs). Specifically: 1. **Memory access bottleneck**: Although Depthwise Convolution (DW) and Pointwise Convolution (PW) reduce the number of parameters and the amount of computation, their computation - to - memory - access ratio is low, causing memory access to become a performance bottleneck. 2. **Limitations of existing fusion methods**: The existing upper - level fusion methods on GPUs have the following problems: - They only consider fusing convolutions with other non - convolutional operations (such as normalization, non - linear activation, etc.). - They do not explicitly optimize memory access. - They do not support depthwise convolution. To solve these problems, this paper proposes a new method, namely **Fused Convolutional Modules (FCMs)**, which fuses depthwise convolution and pointwise convolution on GPUs to reduce memory access and improve execution time and energy efficiency. In addition, the paper also proposes a tool named **FusePlanner** for evaluating the performance of different fusion strategies and selecting the optimal fusion scheme. ### Main contributions 1. **Proposing FCMs**: A series of novel GPU kernels that include the fusion of depthwise convolution and pointwise convolution, reducing global memory access and improving latency and energy efficiency. 2. **Proposing FusePlanner**: Including a cost model for estimating the global memory access of depthwise convolution, pointwise convolution, and FCM kernels. FusePlanner can determine which layers are suitable for fusion and how to select the optimal implementation parameters. 3. **Experimental verification**: Experiments were carried out on multiple GPUs for representative CNN and ViT models. The results show that FCMs achieve up to 83% memory access savings and a speed - up of up to 3.7 times compared to cuDNN. The complete CNN implementation is also up to 1.8 times faster than the TVM - optimized model and saves up to two - thirds of the energy consumption. Through these improvements, the paper aims to overcome the memory access bottleneck of depthwise separable convolution on GPUs, thereby improving the inference efficiency of deep - learning models.

Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs

Optimus: An Operator Fusion Framework for Deep Neural Networks

FuseKNA: Fused Kernel Convolution based Accelerator for Deep Neural Networks

Enabling Efficient Fast Convolution Algorithms on GPUs Via MegaKernels

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

Evolutionary bin packing for memory-efficient dataflow inference acceleration on FPGA

A flexible FPGA accelerator for convolutional neural networks

fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU

Mixed-TD: Efficient Neural Network Accelerator with Layer-Specific Tensor Decomposition

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Relative Indexed Compressed Sparse Filter Encoding Format for Hardware-Oriented Acceleration of Deep Convolutional Neural Networks

Efficient Hardware Architectures for Deep Convolutional Neural Network

A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

Efficient Inference of Large-Scale and Lightweight Convolutional Neural Networks on FPGA

FusionArch: A Fusion-Based Accelerator for Point-Based Point Cloud Neural Networks