Abstract:Machine Learning Computers (MLCs) with tensor functional units (e.g., NVIDIA’s Tensor Core, Google’s TPU and Habana’s Tensor Processor Core) have emerged significantly over recent years. The broad diversity of MLCs makes it hard to deploy machine learning workloads with optimized performance. Though deep learning compilers (e.g., TVM) are effective to produce optimized code for different hardware back-ends, when deploying to a new MLC, it is tedious to implement platform-specific compilation optimizations by thoroughly understanding system/architectural details. To address this problem, we propose a holistic approach to achieve one-size-fits-all compilation optimization for inference across different MLCs. The key observation is that diverse MLCs share multiple key architectural characteristics (e.g., tensor primitives and on-chip scratchpad memory) for tensor processing, which can be generalized for conducting cross-platform compilation optimizations. Concretely, we propose the Tensor Abstract Machine (TAM), which features such common architectural characteristics, as the abstraction of a broad range of MLCs. To leverage architectural characteristics of the TAM, we propose the Tensor Scheduling Language (TSL) consisting of tensor computation description and tensor scheduling primitives for implementing operations with portable optimization. By implementing tensor operations with TSL, the related optimized code for different MLCs can be automatically generated. To validate our proposal, we conduct experiments on 3 commodity MLCs including GPU with Tensor Cores, VTA (on FPGA), and Cloud TPU. Experimental results demonstrate that the code generated from the same optimization schedule achieves 1.05x to 2.05x better performance than hand-tuned libraries and deep learning compilers across different platforms.

Tinker: A Middleware for Deploying Multiple NN-Based Applications on a Single Machine

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine

SingleCaffe: an Efficient Framework for Deep Learning on a Single Node

T2C: A Multi-User System for Deploying DNNs in a Thing-to-Cloud Continuum

SchedTune: A Heterogeneity-Aware GPU Scheduler for Deep Learning

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

Multiple CNN-based Tasks Scheduling Across Shared GPU Platform in Research and Development Scenarios.

Efficient Schedule Construction for Distributed Execution of Large DNN Models

CoTuner: A Hierarchical Learning Framework for Coordinately Optimizing Resource Partitioning and Parameter Tuning.

A Pipelining-Based Heterogeneous Scheduling and Energy-Throughput Optimization Scheme for CNNs Leveraging Apache TVM

Tessel: Boosting Distributed Execution of Large DNN Models via Flexible Schedule Search

Automated Exploration and Implementation of Distributed CNN Inference at the Edge

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters