Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices

Francesco Daghero,Alessio Burrello,Massimo Poncino,Enrico Macii,Daniele Jahier Pagliari
2024-06-18
Abstract:Depthwise separable convolutions are a fundamental component in efficient Deep Neural Networks, as they reduce the number of parameters and operations compared to traditional convolutions while maintaining comparable accuracy. However, their low data reuse opportunities make deploying them notoriously difficult. In this work, we perform an extensive exploration of alternatives to fuse the depthwise and pointwise kernels that constitute the separable convolutional block. Our approach aims to minimize time-consuming memory transfers by combining different data layouts. When targeting a commercial ultra-low-power device with a three-level memory hierarchy, the GreenWaves GAP8 SoC, we reduce the latency of end-to-end network execution by up to 11.40%. Furthermore, our kernels reduce activation data movements between L2 and L1 memories by up to 52.97%.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accelerate depthwise separable convolutions on ultra - low - power devices. Specifically, the paper focuses on how to reduce memory transfers by fusing depthwise convolution (DW) and pointwise convolution (PW), thereby improving the execution efficiency of deep neural networks on these devices. Although depthwise separable convolutions have fewer parameters and less computation, their data reuse opportunities are low, resulting in more difficult deployment on ultra - low - power devices. Therefore, the paper proposes a series of efficient fusion methods, aiming to maximize data reuse for each basic operation while minimizing data transfers and reorganizations between different memory levels. The main contributions of the paper include: 1. Six new fusion kernels are proposed, each of which utilizes different data layouts and processing modes. On the GreenWaves GAP8 SoC, the median computational overhead of these kernels is only 5.13% without considering memory transfers. 2. The open - source AI compiler DORY is extended to support the fusion kernels, and an engine is added to select the layers to be fused based on graph analysis and predefined constraints. Using these kernels as the backend and targeting GAP8, the paper reduces the inference latency of end - to - end execution of deep neural networks by up to 11.40% and reduces the activation memory transfer by up to 27.26%. When minimizing the number of transfers, the number of transfers is reduced by up to 52.97% and the inference latency is reduced by 2.64%. Through these methods, the paper effectively solves the problem of efficient execution of depthwise separable convolutions on ultra - low - power devices and improves the performance and energy efficiency of the model.