Abstract:Deep learning compilers with auto-tuners have the ability to generate high-performance programs, particularly tensor programs on accelerators. However, the performance of these tensor programs is shape-sensitive and hardware resource-sensitive. When the tensor shape is only known at runtime instead of compile time, auto-tuners must tune the tensor programs for every possible shape, leading to significant time and cost overhead. Additionally, if a tensor program tuned for one device is deployed on a different device, the performance may not be as optimal as before. To address these challenges, we propose HAOTuner, a hardware-adaptive deep learning operator auto-tuner specifically designed for dynamic shape tensors. We leverage the concept of micro-kernels as the unit of task allocation and have observed that the size of the micro-kernel greatly impacts performance. In HAOTuner, we determine the size of micro-kernels based not only on the tensor shapes but also on the available hardware resources. Specifically, we present an algorithm to select hardware-friendly micro-kernels as candidates, reducing the tuning time. We also design a cost model that is sensitive to hardware resources to support various hardware architectures. Furthermore, we provide a model transfer solution to enable fast deployment of the cost model on different hardware platforms. We evaluate HAOTuner on six different types of GPUs. The experiments demonstrate that HAOTuner surpasses the state-of-the-art dynamic shape tensor auto-tuner in terms of running time by an average of 26% and tuning time by 25%. Moreover, HAOTuner outperforms the state-of-the-art compiler with padding in terms of running time by an average of 39% and tuning time by 6×.

swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor

Swdnn: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight.

DaDianNao: A Machine-Learning Supercomputer

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer.

swTVM: Towards Optimized Tensor Code Generation for Deep Learning on Sunway Many-Core Processor

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Swtensor: Accelerating Tensor Decomposition on Sunway Architecture

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Automatic Optimization of Parallel Parameters for Sunway TaihuLight Super-computer Application Program

Taming the "Monster": Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010

Scalable Deep-Learning-Accelerated Topology Optimization for Additively Manufactured Materials

HAOTuner: A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers

Parallel Tridiagonal Solver on Sunway Many-Core Processors*

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

swFLOW: A large-scale distributed framework for deep learning on Sunway TaihuLight supercomputer

Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor.