Abstract:Deep Learning Accelerators (DLAs) are effective to improve both performance and energy efficiency of compute-intensive deep learning algorithms. A flexible and portable mean to exploit DLAs is using high-performance software libraries with well-established APIs, which are typically either manually implemented or automatically generated by exploration-based compilation approaches. Though exploration-based approaches significantly reduce programming efforts, they fail to find optimal or near-optimal programs from a large but low-quality search space because the massive inherent constraints of DLAs cannot be accurately characterized. In this paper, we propose Heron, a novel exploration-based approach, to efficiently generate high-performance libraries of DLAs. The key is to automatically (rather than manually) enforce massive sophisticated while accurate constraints through the entire program generation including constrained space generation and constrained space exploration. By conducting static analysis on compute, sophisticated constraints are automatically generated to properly characterize inherent constraints of DLAs, and thus greatly prune invalid program candidates to produce a high-quality constrained search space. To efficiently explore the resultant search space, we further propose a novel constraint-based genetic algorithm, which features that the evolutionary process is conducted on formulated constraint satisfactory problems instead of concrete solutions. Thus, the sophisticated constraints of the search space are strictly preserved during the entire exploration process. We conduct extensive experiments on 3 representative DLAs, i.e., NVIDIA TensorCore, Intel DL Boost Acceleration, and TVM Versatile Tensor Accelerator. Experimental results demonstrate that Heron averagely achieves 2.71x speedup over four state-of-the-art automatic generation approaches. Also, compared to vendor-provided hand-tuned libraries, Heron achieves 2.00x speedup on average.

Ansor: Generating {High-Performance} tensor programs for deep learning

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Enabling Tensor Language Model to Assist in Generating High-Performance Tensor Programs for Deep Learning.

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

ETO: Accelerating Optimization of DNN Operators by High-Performance Tensor Program Reuse

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations.

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators

OLLIE: Derivation-based Tensor Program Optimizer

High-Performance Generalized Tensor Operations

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Bring Your Own Codegen to Deep Learning Compiler

TensorIR: an Abstraction for Automatic Tensorized Program Optimization.

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

Efficient and Fast High-performance Library Generation for Deep Learning Accelerators

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

HAOTuner: A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers

Machine Learning-enabled Performance Model for DNN Applications and AI Accelerator

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures

Automatic Generation of Spatial Accelerator for Tensor Algebra