Abstract:Edge devices are seeing tremendous growth in sensing and computational capabilities. Running state-of-the-art deep neural network (NN) based data processing on multi-core CPU processors, embedded Graphics Processing Units (GPU), Tensor Processing Units (TPU), Neural Processing Units (NPU), Deep Learning Accelerators (DLA) etc., edge devices are now able to handle heavy data computations with limited or without cloud connectivity. In addition to hardware resources, software frameworks that optimize a trained neural network (NN) model through weight clustering and pruning, weight and input-output quantization to fewer bits, fusing NN layers etc., for more efficient execution of NN inferences on edge platforms, play an important role in making machine learning at the edge (namely EdgeML) a reality. This paper is a first effort in characterizing these software frameworks for DNN inference optimizations on edge devices, especially edge GPUs which are now ubiquitously used in all embedded deep learning systems. The interactions between software optimizations and the underlying GPU hardware is carefully examined. As most NN optimization engines are proprietary softwares with undocumented internal details in the public domain, our empirical analysis on real embedded GPU platforms using a variety of widely used DNNs, provide various interesting findings. We observe tremendous performance gain and non-negligible accuracy gain from the software optimizations, but also find highly unexpected non-deterministic behaviors such as different outputs on same inputs or increased execution latency for same NN model on more powerful hardware platforms. Application developers using these proprietary software optimization engines, would benefit from our analysis and the discussed implications of our findings, with examples from real applications like intelligent traffic intersection control and Advanced Driving Assistance Systems (ADAS). There are important implications of our findings on performance modeling and prediction research too, that focus on micro-architecture modeling based application performance prediction, but should now additionally consider optimization engines that this paper examines.

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations.

OLLIE: Derivation-based Tensor Program Optimizer

Optimizing DNNs with Partially Equivalent Transformations and Automated Corrections

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Optimizing Tensor Computation Graphs with Equality Saturation and Monte Carlo Tree Search

conv_einsum: A Framework for Representation and Fast Evaluation of Multilinear Operations in Convolutional Tensorial Neural Networks

EN-T: Optimizing Tensor Computing Engines Performance via Encoder-Based Methodology

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

Tensorized NeuroEvolution of Augmenting Topologies for GPU Acceleration

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.

UNIT: Unifying Tensorized Instruction Compilation

DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture

Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices

Extra dimension algorithm: a breakthrough for optimization and enhancing DNN efficiency