Abstract:Network sparsification serves as an effective technique to accelerate Deep Neural Network (DNN) inference. However, existing sparsification techniques often rely on structured sparsity, which yields limited benefits. This is primarily due to the significant memory and computational overhead introduced by numerous sparse storage formats during address generation and gradient updates. Additionally, many of these solutions are tailored solely for the inference phase, neglecting the crucial training phase. In this paper, we introduce STCO, a novel Sparse Tensor Compilation Optimization technique that significantly enhances training efficiency through structured sparse tensor compilation. Central to STCO is the Tensorization-aware Index Entity (TIE) format, which effectively represents structured sparse tensors by eliminating redundant indices and minimizing storage overhead. The TIE format plays a pivotal role in the Address-carry flow (AC flow) pass, which optimizes the data layout at the computational graph level. This pass leverages the TIE format to enhance the efficiency of tensor representations, enabling more compact and efficient sparse tensor storage. Meanwhile, a shape inference pass utilizes the AC flow to derive optimized tensor shapes, further refining the performance of sparse tensor operations. Moreover, the Address-Carry TIE Flow dynamically tracks nonzero addresses, extending the benefits of sparse optimization to both forward and backward propagation. This seamless integration into the training pipeline enables a smooth transition to sparse tensor compilation without significant modifications to existing codebases. To further boost training performance, we implement an operator-level AC flow optimization pass tailored for structured sparse tensors. This pass generates efficient addresses, ensuring minimal computational overhead during sparse tensor operations. The flexibility of STCO allows it to be efficiently integrated into various frameworks or compilers, providing a robust solution for enhancing training efficiency with structured sparse tensors. Experiments demonstrated that STCO achieved impressive speedups of 3.64 ×, 5.43 ×, 4.89 ×, and 3.91 × when compared to state-of-the-art sparse formats on VGG16, ResNet-18, MobileNetV1, and MobileNetV2, respectively. These findings underscore the efficiency and superiority of our proposed approach in leveraging unstructured sparsity for Deep Neural Network inference acceleration.

ETO: Accelerating Optimization of DNN Operators by High-Performance Tensor Program Reuse

EINNET: Optimizing Tensor Programs with Derivation-Based Transformations.

Compiler-assisted Operator Template Library for DNN Accelerators

OLLIE: Derivation-based Tensor Program Optimizer

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Ansor: Generating {High-Performance} tensor programs for deep learning

Optimus: An Operator Fusion Framework for Deep Neural Networks

Optimizing DNNs with Partially Equivalent Transformations and Automated Corrections

NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training

DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion

Optimizing Tensor Computation Graphs with Equality Saturation and Monte Carlo Tree Search

XFC: Enabling Automatic and Fast Operator Synthesis for Mobile Deep Learning Compilation

Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism

Sifter: an Efficient Operator Auto-Tuner with Speculative Design Space Exploration for Deep Learning Compiler

swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor

Syno: Structured Synthesis for Neural Operators

STCO: Enhancing Training Efficiency Via Structured Sparse Tensor Compilation Optimization

Uncovering Nested Data Parallelism and Data Reuse in DNN Computation with FractalTensor