Abstract:Deep Learning Accelerators (DLAs) are effective to improve both performance and energy efficiency of compute-intensive deep learning algorithms. A flexible and portable mean to exploit DLAs is using high-performance software libraries with well-established APIs, which are typically either manually implemented or automatically generated by exploration-based compilation approaches. Though exploration-based approaches significantly reduce programming efforts, they fail to find optimal or near-optimal programs from a large but low-quality search space because the massive inherent constraints of DLAs cannot be accurately characterized. In this paper, we propose Heron, a novel exploration-based approach, to efficiently generate high-performance libraries of DLAs. The key is to automatically (rather than manually) enforce massive sophisticated while accurate constraints through the entire program generation including constrained space generation and constrained space exploration. By conducting static analysis on compute, sophisticated constraints are automatically generated to properly characterize inherent constraints of DLAs, and thus greatly prune invalid program candidates to produce a high-quality constrained search space. To efficiently explore the resultant search space, we further propose a novel constraint-based genetic algorithm, which features that the evolutionary process is conducted on formulated constraint satisfactory problems instead of concrete solutions. Thus, the sophisticated constraints of the search space are strictly preserved during the entire exploration process. We conduct extensive experiments on 3 representative DLAs, i.e., NVIDIA TensorCore, Intel DL Boost Acceleration, and TVM Versatile Tensor Accelerator. Experimental results demonstrate that Heron averagely achieves 2.71x speedup over four state-of-the-art automatic generation approaches. Also, compared to vendor-provided hand-tuned libraries, Heron achieves 2.00x speedup on average.

Efficient and Fast High-performance Library Generation for Deep Learning Accelerators

Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation.

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Being-ahead: Benchmarking and Exploring Accelerators for Hardware-Efficient AI Deployment

Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture

Ansor : Generating High-Performance Tensor Programs for Deep Learning

O-HAS: Optical Hardware Accelerator Search for Boosting Both Acceleration Performance and Development Speed

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

DGNN-Booster: A Generic FPGA Accelerator Framework For Dynamic Graph Neural Network Inference

Automatic Generation of Spatial Accelerator for Tensor Algebra

HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator

NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks

TensorLib - A Spatial Accelerator Generation Framework for Tensor Algebra.

A Small-Footprint Accelerator for Large-Scale Neural Networks

DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator

Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations

A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity

Apollo: Transferable Architecture Exploration