Abstract:Deep neural networks (DNNs) are increasingly deployed in various image recognition and natural language processing applications. The continuous demand for accuracy and high performance has led to innovations in DNN design and a proliferation of new operators. However, existing DNN training frameworks such as PyTorch and TensorFlow only support a limited range of operators and rely on hand-optimized libraries to provide efficient implementations for these operators. To evaluate novel neural networks with new operators, the programmers have to either replace the holistic new operators with existing operators or provide low-level implementations manually. Therefore, a critical requirement for DNN training frameworks is to provide high-performance implementations for the neural networks containing new operators automatically in the absence of efficient library support. In this article, we introduce NeoFlow, which is a flexible framework for enabling efficient compilation for high-performance DNN training. NeoFlow allows the programmers to directly write customized expressions as new operators to be mapped to graph representation and low-level implementations automatically, providing both high programming productivity and high performance. First, NeoFlow provides expression-based automatic differentiation to support customized model definitions with new operators. Then, NeoFlow proposes an efficient compilation system that partitions the neural network graph into subgraphs, explores optimized schedules, and generates high-performance libraries for subgraphs automatically. Finally, NeoFlow develops an efficient runtime system to combine the compilation and training as a whole by overlapping their execution. In the experiments, we examine the numerical accuracy and performance of NeoFlow. The results show that NeoFlow can achieve similar or even better performance at the operator and whole graph level for DNNs compared to deep learning frameworks. Especially, for novel networks training, the geometric mean speedups of NeoFlow to PyTorch, TensorFlow, and CuDNN are 3.16X, 2.43X, and 1.92X, respectively.

Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning

Optimus: An Operator Fusion Framework for Deep Neural Networks

Automatic Deep Learning Operator Fusion on Sunway SW26010 Many-Core Processor

ParallelFusion

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

Chimera: an Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

DNNVM - End-to-End Compiler Leveraging Operation Fusion on FPGA-based CNN Accelerators.

Multi-Modal Image Fusion Via Deep Laplacian Pyramid Hybrid Network

PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives

Work-in-Progress: Furion: Alleviating Overheads for Deep Learning Framework on Single Machine

Fusionmlp: A Mlp-Based Unified Image Fusion Framework

TileFlow: A Framework for Modeling Fusion Dataflow Via Tree-based Analysis.

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference