Abstract:With the rapid development of Deep Learning, more and more applications on the cloud and edge tend to utilize large DNN (Deep Neural Network) models for improved task execution efficiency as well as decision-making quality. Due to memory constraints, models are commonly optimized using compression, pruning, and partitioning algorithms to become deployable onto resource-constrained devices. As the conditions in the computational platform change dynamically, the deployed optimization algorithms should accordingly adapt their solutions. To perform frequent evaluations of these solutions in a timely fashion, RMs (Regression Models) are commonly trained to predict the relevant solution quality metrics, such as the resulted DNN module inference latency, which is the focus of this paper. Existing prediction frameworks specify different RM training workflows, but none of them allow flexible configurations of the input parameters (e.g., batch size, device utilization rate) and of the selected RMs for different modules. In this paper, a deep learning module inference latency prediction framework is proposed, which i) hosts a set of customizable input parameters to train multiple different RMs per DNN module (e.g., convolutional layer) with self-generated datasets, and ii) automatically selects a set of trained RMs leading to the highest possible overall prediction accuracy, while keeping the prediction time / space consumption as low as possible. Furthermore, a new RM, namely MEDN (Multi-task Encoder-Decoder Network), is proposed as an alternative solution. Comprehensive experiment results show that MEDN is fast and lightweight, and capable of achieving the highest overall prediction accuracy and R-squared value. The Time/Space-efficient Auto-selection algorithm also manages to improve the overall accuracy by 2.5% and R-squared by 0.39%, compared to the MEDN single-selection scheme.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in a dynamic environment, the existing deep neural network (DNN) module inference latency prediction frameworks are unable to flexibly configure input parameters (such as batch size, device utilization, etc.), and lack adaptive optimization for different DNN modules. Specifically: 1. **Limitations of Existing Frameworks**: - Although existing prediction frameworks can train regression models (RMs) to predict the inference latency of DNN modules, they do not allow flexible configuration of input parameters, such as batch size, floating - point operations (FLOPS), device utilization, etc. - These frameworks are also unable to select the most suitable regression model for different DNN modules because different modules have different structures and computational complexity. 2. **Adaptability Problems in a Dynamic Environment**: - In cloud and edge - computing environments, the available memory and network bandwidth of devices will change dynamically, which will affect the quality of current solutions. - Therefore, the model optimization algorithm needs to adjust its solutions according to these environmental changes and frequently evaluate the quality of these solutions. 3. **Requirements for the Prediction Framework**: - A flexible prediction framework is required, which can support training multiple different regression models for each DNN module and training based on an automatically generated data set. - This framework should also be able to automatically select a set of trained regression models to achieve the highest overall prediction accuracy while keeping the prediction time and space consumption as low as possible. For this purpose, the paper proposes a new DNN module inference latency prediction framework, which has the following features: - **Flexibility**: Supports customizing input parameters for each DNN module and training multiple different regression models. - **Adaptive Selection**: Through an automatic selection algorithm, selects a set of trained regression models to achieve the highest prediction accuracy and the lowest time/space consumption. - **The Newly Proposed Regression Model MEDN**: Introduces a multi - task encoder - decoder network (MEDN) to replace the traditional regression model selection scheme. MEDN is not only faster and more lightweight but also performs better in terms of prediction accuracy and R - squared values. In summary, this paper aims to solve the problem that the existing DNN module inference latency prediction frameworks lack flexibility and adaptive ability in a dynamic environment, thereby improving the prediction accuracy and efficiency.

Towards A Flexible Accuracy-Oriented Deep Learning Module Inference Latency Prediction Framework for Adaptive Optimization Algorithms

Edge Collaborative Learning Acceleration Based on Latency Prediction

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Accurate Deep Learning Inference Latency Prediction over Dynamic Running Mobile Devices

nn-METER: Towards Accurate Latency Prediction of DNN Inference on Diverse Edge Devices.

nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

Energy-Aware Dynamic Neural Inference

Accelerate Intermittent Deep Inference

nn-METER

On Latency Predictors for Neural Architecture Search

Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference

A Progressive Subnetwork Searching Framework for Dynamic Inference

A DNN Optimization Framework with Unlabeled Data for Efficient and Accurate Reconfigurable Hardware Inference

AutoScale: Optimizing Energy Efficiency of End-to-End Edge Inference under Stochastic Variance

Multi-Predict: Few Shot Predictors For Efficient Neural Architecture Search

Fine-Grained Complexity-Driven Latency Predictor in Hardware-Aware Neural Architecture Search using Composite Loss

Dual-module Inference for Efficient Recurrent Neural Networks