Abstract:Machine learning models with various tensor operators are becoming ubiquitous in recent years. There are two types of operators in machine learning: compute-intensive operators (e.g., GEMM and convolution) and memory-intensive operators (e.g., ReLU and softmax). In emerging machine learning models, compute-intensive operators are usually organized in a chain structure. With the continual specialization of hardware, the gap between computing performance and memory bandwidth has become more prominent. Consequently, the implementations of many compute-intensive operator chains are bounded by memory bandwidth, and generating fused kernels to improve locality for these compute-intensive operators becomes necessary. But in existing machine learning compilers, there lack both precise analysis and efficient optimization for compute-intensive operator chains on different accelerators. As a result, they usually produce sub-optimal performance for these operator chains.In this paper, we propose Chimera, an optimizing framework that can efficiently improve the locality of compute-intensive operator chains on different hardware accelerators. In Chimera, each compute-intensive operator is composed of a series of computation blocks. To generate efficient fused kernels for the operator chains, optimizations for both inter-block and intra-block are required. For inter-block optimization, Chimera decides the optimized block execution order by minimizing the data movement volume among blocks using an analytical model. For intra-block optimization, Chimera uses unified replaceable micro kernels to apply hardware-specific optimizations for different accelerators. Finally, Chimera generates fused kernels for compute-intensive operator chains. Evaluation of batch GEMM chains and convolution chains on CPU, GPU, and NPU shows that Chimera achieves up to 2.87×, 2.29×, and 2.39× speedups to hand-tuned libraries. Compared to state-of-the-art compilers, the speedups are up to 2.29×, 1.64×, and 1.14× for CPU, GPU, and NPU.

Optimal Kernel Orchestration for Tensor Programs with Korch

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning

Klotski: DNN Model Orchestration Framework for Dataflow Architecture Accelerators

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Kernel Operations on the GPU, with Autodiff, without Memory Overflows

CORF: Bridging the Gap of Complex Operator Fusion for Faster DNN Inference.

Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Chimera: an Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores

Ansor : Generating High-Performance Tensor Programs for Deep Learning

ETO: Accelerating Optimization of DNN Operators by High-Performance Tensor Program Reuse

Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs

PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization

ThunderKittens: Simple, Fast, and Adorable AI Kernels

TorchOpt: an Efficient Library for Differentiable Optimization.

Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance

KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion