Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs

Fareed Qararyah,Muhammad Waqar Azhar,Mohammad Ali Maleki,Pedro Trancoso
2024-08-05
Abstract:Depthwise and pointwise convolutions have fewer parameters and perform fewer operations than standard convolutions. As a result, they have become increasingly used in various compact DNNs, including convolutional neural networks (CNNs) and vision transformers (ViTs). However, they have a lower compute-to-memory-access ratio than standard convolutions, making their memory accesses often the performance bottleneck. This paper explores fusing depthwise and pointwise convolutions to overcome the memory access bottleneck. The focus is on fusing these operators on GPUs. The prior art on GPU-based fusion suffers from one or more of the following: (1) fusing either a convolution with an element-wise or multiple non-convolutional operators, (2) not explicitly optimizing for memory accesses, (3) not supporting depthwise convolutions. This paper proposes Fused Convolutional Modules (FCMs), a set of novel fused depthwise and pointwise GPU kernels. FCMs significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency. To evaluate the trade-offs associated with fusion and determine which convolutions are beneficial to fuse and the optimal FCM parameters, we propose FusePlanner. FusePlanner consists of cost models to estimate the memory accesses of depthwise, pointwise, and FCM kernels given GPU characteristics. Our experiments on three GPUs using representative CNNs and ViTs demonstrate that FCMs save up to 83\% of the memory accesses and achieve speedups of up to 3.7x compared to cuDNN. Complete model implementations of various CNNs using our modules outperform TVMs' achieving speedups of up to 1.8x and saving up to two-thirds of the energy. FCM and FusePlanner implementations are open source: <a class="link-external link-https" href="https://github.com/fqararyah/Fusing_DW_and_PW_on_GPUs" rel="external noopener nofollow">this https URL</a>.
Performance,Hardware Architecture,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance bottleneck of Depthwise Separable Convolutions (DSC) on Graphics Processing Units (GPUs). Specifically: 1. **Memory access bottleneck**: Although Depthwise Convolution (DW) and Pointwise Convolution (PW) reduce the number of parameters and the amount of computation, their computation - to - memory - access ratio is low, causing memory access to become a performance bottleneck. 2. **Limitations of existing fusion methods**: The existing upper - level fusion methods on GPUs have the following problems: - They only consider fusing convolutions with other non - convolutional operations (such as normalization, non - linear activation, etc.). - They do not explicitly optimize memory access. - They do not support depthwise convolution. To solve these problems, this paper proposes a new method, namely **Fused Convolutional Modules (FCMs)**, which fuses depthwise convolution and pointwise convolution on GPUs to reduce memory access and improve execution time and energy efficiency. In addition, the paper also proposes a tool named **FusePlanner** for evaluating the performance of different fusion strategies and selecting the optimal fusion scheme. ### Main contributions 1. **Proposing FCMs**: A series of novel GPU kernels that include the fusion of depthwise convolution and pointwise convolution, reducing global memory access and improving latency and energy efficiency. 2. **Proposing FusePlanner**: Including a cost model for estimating the global memory access of depthwise convolution, pointwise convolution, and FCM kernels. FusePlanner can determine which layers are suitable for fusion and how to select the optimal implementation parameters. 3. **Experimental verification**: Experiments were carried out on multiple GPUs for representative CNN and ViT models. The results show that FCMs achieve up to 83% memory access savings and a speed - up of up to 3.7 times compared to cuDNN. The complete CNN implementation is also up to 1.8 times faster than the TVM - optimized model and saves up to two - thirds of the energy consumption. Through these improvements, the paper aims to overcome the memory access bottleneck of depthwise separable convolution on GPUs, thereby improving the inference efficiency of deep - learning models.