Abstract:Edge devices are seeing tremendous growth in sensing and computational capabilities. Running state-of-the-art deep neural network (NN) based data processing on multi-core CPU processors, embedded Graphics Processing Units (GPU), Tensor Processing Units (TPU), Neural Processing Units (NPU), Deep Learning Accelerators (DLA) etc., edge devices are now able to handle heavy data computations with limited or without cloud connectivity. In addition to hardware resources, software frameworks that optimize a trained neural network (NN) model through weight clustering and pruning, weight and input-output quantization to fewer bits, fusing NN layers etc., for more efficient execution of NN inferences on edge platforms, play an important role in making machine learning at the edge (namely EdgeML) a reality. This paper is a first effort in characterizing these software frameworks for DNN inference optimizations on edge devices, especially edge GPUs which are now ubiquitously used in all embedded deep learning systems. The interactions between software optimizations and the underlying GPU hardware is carefully examined. As most NN optimization engines are proprietary softwares with undocumented internal details in the public domain, our empirical analysis on real embedded GPU platforms using a variety of widely used DNNs, provide various interesting findings. We observe tremendous performance gain and non-negligible accuracy gain from the software optimizations, but also find highly unexpected non-deterministic behaviors such as different outputs on same inputs or increased execution latency for same NN model on more powerful hardware platforms. Application developers using these proprietary software optimization engines, would benefit from our analysis and the discussed implications of our findings, with examples from real applications like intelligent traffic intersection control and Advanced Driving Assistance Systems (ADAS). There are important implications of our findings on performance modeling and prediction research too, that focus on micro-architecture modeling based application performance prediction, but should now additionally consider optimization engines that this paper examines.

EdgeNN: Efficient Neural Network Inference for CPU-GPU Integrated Edge Devices.

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

CoEdge: Cooperative DNN Inference With Adaptive Workload Partitioning Over Heterogeneous Edge Devices

Benchmarking Edge AI Platforms for High-Performance ML Inference

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

Accelerating Deep Neural Network Tasks Through Edge-Device Adaptive Inference

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Boosting DNN Cold Inference on Edge Devices

Demystifying TensorRT: Characterizing Neural Network Inference Engine on Nvidia Edge Devices

Online Scheduling of CPU-NPU Co-inference for Edge AI Tasks.

Design and Implementation of Deep Neural Network for Edge Computing

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

OnceNAS: Discovering Efficient On-Device Inference Neural Networks for Edge Devices

Edge Devices Inference Performance Comparison

Optimization and Deployment of DNNs for RISC-V-based Edge AI

EdgeKE: An On-Demand Deep Learning IoT System for Cognitive Big Data on Industrial Edge Devices