SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Kun Wang,Jiani Cao,Zimu Zhou,Zhenjiang Li

DOI: https://doi.org/10.1109/TMC.2024.3355764

2024-01-30

Abstract:Executing deep neural networks (DNNs) on edge artificial intelligence (AI) devices enables various autonomous mobile computing applications. However, the memory budget of edge AI devices restricts the number and complexity of DNNs allowed in such applications. Existing solutions, such as model compression or cloud offloading, reduce the memory footprint of DNN inference at the cost of decreased model accuracy or autonomy. To avoid these drawbacks, we divide DNN into blocks and swap them in and out in order, such that large DNNs can execute within a small memory budget. Nevertheless, naive swapping on edge AI devices induces significant delays due to the redundant memory operations in the DNN development ecosystem for edge AI devices. To this end, we develop SwapNet, an efficient DNN block swapping middleware for edge AI devices. We systematically eliminate the unnecessary memory operations during block swapping while retaining compatible with the deep learning frameworks, GPU backends, and hardware architectures of edge AI devices. We further showcase the utility of SwapNet via a multi-DNN scheduling scheme. Evaluations on eleven DNN inference tasks in three applications demonstrate that SwapNet achieves almost the same latency as the case with sufficient memory even when DNNs demand 2.32x to 5.81x memory beyond the available budget. The design of SwapNet also provides novel and feasible insights for deploying large language models (LLMs) on edge AI devices in the future.

Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing

What problem does this paper attempt to address?

The paper addresses the problem of executing deep neural networks (DNNs) on edge artificial intelligence (AI) devices, which have limited memory budgets restricting the number and complexity of DNNs that can be run. Existing solutions, such as model compression or cloud offloading, reduce memory consumption but may lead to accuracy degradation or loss of autonomy. The paper proposes a novel approach called SwapNet, which partitions large DNNs into blocks and executes them by swapping these blocks in sequence, enabling DNNs that exceed the memory budget to run in small memory. However, raw block swapping on edge AI devices leads to significant latency due to redundant memory operations in the DNN development ecosystem. To address this, SwapNet designs an efficient DNN block swapping middleware that systematically eliminates unnecessary memory operations during the block swapping process while maintaining compatibility with deep learning frameworks, GPU backends, and hardware architectures. SwapNet also demonstrates its practicality through multiple DNN scheduling schemes. Experiments show that even when the DNN's demand exceeds the available memory by a factor of 2.32 to 5.81, SwapNet achieves nearly the same latency as in memory-abundant scenarios. Furthermore, SwapNet's design provides new insights for deploying large language models on edge AI devices in the future. Overall, the paper attempts to address the problem of efficiently and accurately executing large DNN models on memory-limited edge AI devices, avoiding the drawbacks of traditional methods such as accuracy loss or reliance on external resources.

SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

DaDianNao: A Machine-Learning Supercomputer

FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices.

FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge

Enabling Large Neural Networks on Tiny Microcontrollers with Swapping

Distributed Assignment With Load Balancing for DNN Inference at the Edge

Accelerating Tensor Swapping in GPUs with Self-Tuning Compression

Memory-efficient Deep Learning Inference with Incremental Weight Loading and Data Layout Reorganization on Edge Systems.

An Online Approach for DNN Model Caching and Processor Allocation in Edge Computing

CoEdge: Cooperative DNN Inference With Adaptive Workload Partitioning Over Heterogeneous Edge Devices

A Swap Dominated Tensor Re-Generation Strategy for Training Deep Learning Models

Efficient Memory Management for Deep Neural Net Inference

Dynamic DNN Decomposition for Lossless Synergistic Inference

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

DNNOff: Offloading DNN-Based Intelligent IoT Applications in Mobile Edge Computing

Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN

An Application-oblivious Memory Scheduling System for DNN Accelerators

A DNN inference acceleration algorithm combining model partition and task allocation in heterogeneous edge computing system

Joint Optimization of Data Transfer and Co-Execution for DNN in Edge Computing