Abstract:Machine learning models with various tensor operators are becoming ubiquitous in recent years. There are two types of operators in machine learning: compute-intensive operators (e.g., GEMM and convolution) and memory-intensive operators (e.g., ReLU and softmax). In emerging machine learning models, compute-intensive operators are usually organized in a chain structure. With the continual specialization of hardware, the gap between computing performance and memory bandwidth has become more prominent. Consequently, the implementations of many compute-intensive operator chains are bounded by memory bandwidth, and generating fused kernels to improve locality for these compute-intensive operators becomes necessary. But in existing machine learning compilers, there lack both precise analysis and efficient optimization for compute-intensive operator chains on different accelerators. As a result, they usually produce sub-optimal performance for these operator chains.In this paper, we propose Chimera, an optimizing framework that can efficiently improve the locality of compute-intensive operator chains on different hardware accelerators. In Chimera, each compute-intensive operator is composed of a series of computation blocks. To generate efficient fused kernels for the operator chains, optimizations for both inter-block and intra-block are required. For inter-block optimization, Chimera decides the optimized block execution order by minimizing the data movement volume among blocks using an analytical model. For intra-block optimization, Chimera uses unified replaceable micro kernels to apply hardware-specific optimizations for different accelerators. Finally, Chimera generates fused kernels for compute-intensive operator chains. Evaluation of batch GEMM chains and convolution chains on CPU, GPU, and NPU shows that Chimera achieves up to 2.87×, 2.29×, and 2.39× speedups to hand-tuned libraries. Compared to state-of-the-art compilers, the speedups are up to 2.29×, 1.64×, and 1.14× for CPU, GPU, and NPU.

Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor

swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

DaDianNao: A Machine-Learning Supercomputer

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

Optimus: An Operator Fusion Framework for Deep Neural Networks

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning

Swdnn: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight.

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Chimera: an Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

swTVM: Towards Optimized Tensor Code Generation for Deep Learning on Sunway Many-Core Processor

DNNVM - End-to-End Compiler Leveraging Operation Fusion on FPGA-based CNN Accelerators.

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer.

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

TileFlow: A Framework for Modeling Fusion Dataflow Via Tree-based Analysis.

Swtensor: Accelerating Tensor Decomposition on Sunway Architecture