Abstract:Network sparsification serves as an effective technique to accelerate Deep Neural Network (DNN) inference. However, existing sparsification techniques often rely on structured sparsity, which yields limited benefits. This is primarily due to the significant memory and computational overhead introduced by numerous sparse storage formats during address generation and gradient updates. Additionally, many of these solutions are tailored solely for the inference phase, neglecting the crucial training phase. In this paper, we introduce STCO, a novel Sparse Tensor Compilation Optimization technique that significantly enhances training efficiency through structured sparse tensor compilation. Central to STCO is the Tensorization-aware Index Entity (TIE) format, which effectively represents structured sparse tensors by eliminating redundant indices and minimizing storage overhead. The TIE format plays a pivotal role in the Address-carry flow (AC flow) pass, which optimizes the data layout at the computational graph level. This pass leverages the TIE format to enhance the efficiency of tensor representations, enabling more compact and efficient sparse tensor storage. Meanwhile, a shape inference pass utilizes the AC flow to derive optimized tensor shapes, further refining the performance of sparse tensor operations. Moreover, the Address-Carry TIE Flow dynamically tracks nonzero addresses, extending the benefits of sparse optimization to both forward and backward propagation. This seamless integration into the training pipeline enables a smooth transition to sparse tensor compilation without significant modifications to existing codebases. To further boost training performance, we implement an operator-level AC flow optimization pass tailored for structured sparse tensors. This pass generates efficient addresses, ensuring minimal computational overhead during sparse tensor operations. The flexibility of STCO allows it to be efficiently integrated into various frameworks or compilers, providing a robust solution for enhancing training efficiency with structured sparse tensors. Experiments demonstrated that STCO achieved impressive speedups of 3.64 ×, 5.43 ×, 4.89 ×, and 3.91 × when compared to state-of-the-art sparse formats on VGG16, ResNet-18, MobileNetV1, and MobileNetV2, respectively. These findings underscore the efficiency and superiority of our proposed approach in leveraging unstructured sparsity for Deep Neural Network inference acceleration.

Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

OLLIE: Derivation-based Tensor Program Optimizer

SoD$^2$: Statically Optimizing Dynamic Deep Neural Network

STCO: Enhancing Training Efficiency Via Structured Sparse Tensor Compilation Optimization

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

ALT: Breaking the Wall Between Data Layout and Loop Optimizations for Deep Learning Compilation

TSTC: Enabling Efficient Training Via Structured Sparse Tensor Compilation

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

Optimizing Tensor Computation Graphs with Equality Saturation and Monte Carlo Tree Search

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

TensorIR: an Abstraction for Automatic Tensorized Program Optimization.

Exploring Approximation and Dataflow Co-Optimization for Scalable Transformer Inference Architecture on the Edge

TSCompiler: Efficient Compilation Framework for Dynamic-Shape Models

ETO: Accelerating Optimization of DNN Operators by High-Performance Tensor Program Reuse

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR