Abstract:Deep neural networks (DNNs) are increasingly deployed in various image recognition and natural language processing applications. The continuous demand for accuracy and high performance has led to innovations in DNN design and a proliferation of new operators. However, existing DNN training frameworks such as PyTorch and TensorFlow only support a limited range of operators and rely on hand-optimized libraries to provide efficient implementations for these operators. To evaluate novel neural networks with new operators, the programmers have to either replace the holistic new operators with existing operators or provide low-level implementations manually. Therefore, a critical requirement for DNN training frameworks is to provide high-performance implementations for the neural networks containing new operators automatically in the absence of efficient library support. In this article, we introduce NeoFlow, which is a flexible framework for enabling efficient compilation for high-performance DNN training. NeoFlow allows the programmers to directly write customized expressions as new operators to be mapped to graph representation and low-level implementations automatically, providing both high programming productivity and high performance. First, NeoFlow provides expression-based automatic differentiation to support customized model definitions with new operators. Then, NeoFlow proposes an efficient compilation system that partitions the neural network graph into subgraphs, explores optimized schedules, and generates high-performance libraries for subgraphs automatically. Finally, NeoFlow develops an efficient runtime system to combine the compilation and training as a whole by overlapping their execution. In the experiments, we examine the numerical accuracy and performance of NeoFlow. The results show that NeoFlow can achieve similar or even better performance at the operator and whole graph level for DNNs compared to deep learning frameworks. Especially, for novel networks training, the geometric mean speedups of NeoFlow to PyTorch, TensorFlow, and CuDNN are 3.16X, 2.43X, and 1.92X, respectively.

Compiler-assisted Operator Template Library for DNN Accelerators

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

Optimus: An Operator Fusion Framework for Deep Neural Networks

LLM-Aided Compilation for Tensor Accelerators

LoopStack: a Lightweight Tensor Algebra Compiler Stack

swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

An Automated Compiler for RISC-V Based DNN Accelerator

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

UNIT: Unifying Tensorized Instruction Compilation

OLLIE: Derivation-based Tensor Program Optimizer

TSCompiler: Efficient Compilation Framework for Dynamic-Shape Models

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

XFC: Enabling Automatic and Fast Operator Synthesis for Mobile Deep Learning Compilation

NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training

HAOTuner: A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers

Automatic generation of CUDA code performing tensor manipulations using C++ expression templates

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10

GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores