CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

Hanpeng Hu,Junwei Su,Juntao Zhao,Yanghua Peng,Yibo Zhu,Haibin Lin,Chuan Wu

DOI: https://doi.org/10.1145/3627703.3629572

2023-11-17

Abstract:Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at <a class="link-external link-https" href="https://github.com/joapolarbear/cdmpp" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Performance

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of predicting the execution latency of tensor programs in various deep neural network (DNN) models across different devices. Specifically: 1. **Cross-Model Performance Prediction (CMPP)**: - Modeling the performance of tensor programs in different DNN models on a specific device and predicting the execution time of unseen tensor programs. 2. **Cross-Device Performance Prediction (CDPP)**: - Predicting the execution time of a tensor program on a target device based on performance knowledge from other devices. To solve these two problems, the authors propose the CDMPP framework, which efficiently predicts the absolute execution latency of tensor programs from different DNN models and devices, including training accelerators and inference accelerators. The main contributions include: - Proposing a concise and easy-to-train tensor program representation method—Compact Abstract Syntax Trees (Compact ASTs), to capture the internal structure of tensor programs. - Introducing a pre-order position encoding method to handle Compact ASTs. - Designing a domain-adaptive method to learn invariant representations and proposing a K-means clustering-based sampling algorithm to guide performance evaluation on the target device. - Implementing a replayer to predict end-to-end DNN performance by estimating the latency of each tensor program from the bottom up. Experimental results show that CDMPP significantly outperforms existing baseline methods in cross-model and cross-device tensor program latency prediction, with prediction errors of 14.03% and 10.85%, respectively, and improves training efficiency by nearly an order of magnitude.

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

Edge Collaborative Learning Acceleration Based on Latency Prediction

nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

nn-METER: Towards Accurate Latency Prediction of DNN Inference on Diverse Edge Devices.

Machine Learning-enabled Performance Model for DNN Applications and AI Accelerator

Towards A Flexible Accuracy-Oriented Deep Learning Module Inference Latency Prediction Framework for Adaptive Optimization Algorithms

nn-METER

Accurate Deep Learning Inference Latency Prediction over Dynamic Running Mobile Devices

DNN-Chip Predictor: An Analytical Performance Predictor for DNN Accelerators with Various Dataflows and Hardware Architectures

DNNAbacus: Toward Accurate Computational Cost Prediction for Deep Neural Networks

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Ace-Sniper: Cloud-Edge Collaborative Scheduling Framework With DNN Inference Latency Modeling on Heterogeneous Devices

CAP: Communication-aware Automated Parallelization for Deep Learning Inference on CMP Architectures

On Latency Predictors for Neural Architecture Search

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network.

CP: Hierarchical Cross-Platform Power/Performance Prediction Using a Transfer Learning Approach.

NNLQP: A Multi-Platform Neural Network Latency Query and Prediction System with An Evolving Database

Machine Learning Enabled Scalable Performance Prediction of Scientific Codes

Towards CPU Performance Prediction: New Challenge Benchmark Dataset and Novel Approach