Abstract:In recent years, memory-intensive operations are becoming dominant in efficiency of running novel neural networks. Just-in-time operator fusion on accelerating devices like GPU proves an effective method for optimizing memory-intensive operations, and suits the numerous varying model structures. In particular, we find memory-intensive operations on tensor views are ubiquitous in neural network implementations. Tensors are the de facto representation for numerical data in deep learning areas, while tensor views cover a bunch of sophisticated syntax, which allow various interpretations on the underlying tensor data without memory copy. The support of views in deep learning compilers could greatly enlarge operator fusion scope, and appeal to optimizing novel neural networks. Nevertheless, mainstream solutions in state-of-the-art deep learning compilers exhibit imperfections either in view syntax representations or operator fusion. In this article, we propose EasyView, which enables and schedules tensor views in an end-to-end workflow from neural networks onto devices. Aiming at maximizing memory utilization and reducing data movement, we categorize various view contexts in high-level language, and lower views in accordance with different scenarios. Reference-semantic in terms of views are kept in the lowering from native high-level language features to intermediate representations. Based on the reserved reference-semantics, memory activities related to data dependence of read and write are tracked for further compute and memory optimization. Besides, ample operator fusion is applied to memory-intensive operations with views. In our tests, the proposed work could get average 5.63X, 2.44X, and 4.67X speedup compared with the XLA, JAX, and TorchScript, respectively for hotspot Python functions. In addition, operation fusion with views could bring 8.02% performance improvement in end-to-end neural networks.

XFC: Enabling Automatic and Fast Operator Synthesis for Mobile Deep Learning Compilation

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

TSCompiler: Efficient Compilation Framework for Dynamic-Shape Models

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

RAF: Holistic Compilation for Deep Learning Model Training

EasyView: Enabling and Scheduling Tensor Views in Deep Learning Compilers

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor

BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads Via Compiler Approach

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

HAOTuner: A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers

AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph Optimization

Syno: Structured Synthesis for Neural Operators

swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

Compiler-Level Matrix Multiplication Optimization for Deep Learning

A High-Performance Dataflow-Centric Optimization Framework for Deep Learning Inference on the Edge