SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Kun Wang,Jiani Cao,Zimu Zhou,Zhenjiang Li
DOI: https://doi.org/10.1109/TMC.2024.3355764
2024-01-30
Abstract:Executing deep neural networks (DNNs) on edge artificial intelligence (AI) devices enables various autonomous mobile computing applications. However, the memory budget of edge AI devices restricts the number and complexity of DNNs allowed in such applications. Existing solutions, such as model compression or cloud offloading, reduce the memory footprint of DNN inference at the cost of decreased model accuracy or autonomy. To avoid these drawbacks, we divide DNN into blocks and swap them in and out in order, such that large DNNs can execute within a small memory budget. Nevertheless, naive swapping on edge AI devices induces significant delays due to the redundant memory operations in the DNN development ecosystem for edge AI devices. To this end, we develop SwapNet, an efficient DNN block swapping middleware for edge AI devices. We systematically eliminate the unnecessary memory operations during block swapping while retaining compatible with the deep learning frameworks, GPU backends, and hardware architectures of edge AI devices. We further showcase the utility of SwapNet via a multi-DNN scheduling scheme. Evaluations on eleven DNN inference tasks in three applications demonstrate that SwapNet achieves almost the same latency as the case with sufficient memory even when DNNs demand 2.32x to 5.81x memory beyond the available budget. The design of SwapNet also provides novel and feasible insights for deploying large language models (LLMs) on edge AI devices in the future.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper addresses the problem of executing deep neural networks (DNNs) on edge artificial intelligence (AI) devices, which have limited memory budgets restricting the number and complexity of DNNs that can be run. Existing solutions, such as model compression or cloud offloading, reduce memory consumption but may lead to accuracy degradation or loss of autonomy. The paper proposes a novel approach called SwapNet, which partitions large DNNs into blocks and executes them by swapping these blocks in sequence, enabling DNNs that exceed the memory budget to run in small memory. However, raw block swapping on edge AI devices leads to significant latency due to redundant memory operations in the DNN development ecosystem. To address this, SwapNet designs an efficient DNN block swapping middleware that systematically eliminates unnecessary memory operations during the block swapping process while maintaining compatibility with deep learning frameworks, GPU backends, and hardware architectures. SwapNet also demonstrates its practicality through multiple DNN scheduling schemes. Experiments show that even when the DNN's demand exceeds the available memory by a factor of 2.32 to 5.81, SwapNet achieves nearly the same latency as in memory-abundant scenarios. Furthermore, SwapNet's design provides new insights for deploying large language models on edge AI devices in the future. Overall, the paper attempts to address the problem of efficiently and accurately executing large DNN models on memory-limited edge AI devices, avoiding the drawbacks of traditional methods such as accuracy loss or reliance on external resources.