Abstract:Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (CNN) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendor-provided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced.

Performance of Convolution Neural Network based on Multiple GPUs with Different Data Communication Models

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Performance Analysis of GPU-Based Convolutional Neural Networks

DaDianNao: A Machine-Learning Supercomputer

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

HongTu: Scalable Full-Graph GNN Training on Multiple GPUs (via communication-optimized CPU data offloading)

Data-parallel distributed training of very large models beyond GPU capacity

Optimization of GPU Memory Usage for Training Deep Neural Networks.

Effect of neural network structure in accelerating performance and accuracy of a convolutional neural network with GPU/TPU for image analytics

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

A Unified CPU-GPU Protocol for GNN Training

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training

A Survey of Multi-Tenant Deep Learning Inference on GPU

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters