Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices

Francesco Daghero,Alessio Burrello,Massimo Poncino,Enrico Macii,Daniele Jahier Pagliari

2024-06-18

Abstract:Depthwise separable convolutions are a fundamental component in efficient Deep Neural Networks, as they reduce the number of parameters and operations compared to traditional convolutions while maintaining comparable accuracy. However, their low data reuse opportunities make deploying them notoriously difficult. In this work, we perform an extensive exploration of alternatives to fuse the depthwise and pointwise kernels that constitute the separable convolutional block. Our approach aims to minimize time-consuming memory transfers by combining different data layouts. When targeting a commercial ultra-low-power device with a three-level memory hierarchy, the GreenWaves GAP8 SoC, we reduce the latency of end-to-end network execution by up to 11.40%. Furthermore, our kernels reduce activation data movements between L2 and L1 memories by up to 52.97%.

Machine Learning,Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accelerate depthwise separable convolutions on ultra - low - power devices. Specifically, the paper focuses on how to reduce memory transfers by fusing depthwise convolution (DW) and pointwise convolution (PW), thereby improving the execution efficiency of deep neural networks on these devices. Although depthwise separable convolutions have fewer parameters and less computation, their data reuse opportunities are low, resulting in more difficult deployment on ultra - low - power devices. Therefore, the paper proposes a series of efficient fusion methods, aiming to maximize data reuse for each basic operation while minimizing data transfers and reorganizations between different memory levels. The main contributions of the paper include: 1. Six new fusion kernels are proposed, each of which utilizes different data layouts and processing modes. On the GreenWaves GAP8 SoC, the median computational overhead of these kernels is only 5.13% without considering memory transfers. 2. The open - source AI compiler DORY is extended to support the fusion kernels, and an engine is added to select the layers to be fused based on graph analysis and predefined constraints. Using these kernels as the backend and targeting GAP8, the paper reduces the inference latency of end - to - end execution of deep neural networks by up to 11.40% and reduces the activation memory transfer by up to 27.26%. When minimizing the number of transfers, the number of transfers is reduced by up to 52.97% and the inference latency is reduced by 2.64%. Through these methods, the paper effectively solves the problem of efficient execution of depthwise separable convolutions on ultra - low - power devices and improves the performance and energy efficiency of the model.

Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices

Optimizing Depthwise Separable Convolution Operations on GPUs

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

DaDianNao: A Machine-Learning Supercomputer

Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration

A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

A 3D Tiled Low Power Accelerator for Convolutional Neural Network

Hardware Implementation of Depthwise Separable Convolution Neural Network

XSepConv: Extremely Separated Convolution

Depth-wise Decomposition for Accelerating Separable Convolutions in Efficient Convolutional Neural Networks

Efficient Deconvolution Architecture for Heterogeneous Systems-on-Chip

High Performance Depthwise and Pointwise Convolutions on Mobile Devices

DeepDive: An Integrative Algorithm/Architecture Co-Design for Deep Separable Convolutional Neural Networks

Exploration for Efficient Depthwise Separable Convolution Networks Deployment on FPGA

Efficient depthwise separable convolution accelerator for classification and UAV object detection

Depthwise Separable Convolutions with Deep Residual Convolutions

XSepConv: Extremely Separated Convolution for Efficient Deep Networks with Large Kernels

Parallel GEMM-based convolution for deep learning on multicore RISC-V processors

Network Decoupling: From Regular to Depthwise Separable Convolutions

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective