Abstract:N:M sparsity pruning is a powerful technique for compressing deep neural networks, utilizing NVIDIA's Sparse Tensor Core technology. This method benefits from hardware support for sparse indexing, enabling the adoption of fine-grained sparsity to maintain model accuracy while minimizing the overhead typically associated with irregular data access. Although restricted to a fixed level of sparsity due to its reliance on hardware, N:M sparsity can be combined with coarser sparsity techniques to achieve diverse compression ratios. Initially, column-wise vector sparsity is applied to a dense model, followed by row-wise N:M sparsity on the preserved column vectors. We call this multi-level approach as hierarchical N:M (HiNM) sparsity. Similar to earlier single-level sparsity techniques, HiNM sparsity necessitates an effective channel permutation strategy to maximize the accuracy of the compressed networks. However, it introduces further complexities by requiring the rearrangement of both input and output channels, addressing challenges such as permutation sequence, HiNM-sparsity-aware permutation, and maintaining consistency in channel ordering across layers. In this paper, we introduce a channel permutation method designed specifically for HiNM sparsity, named gyro-permutation. This method is crafted to exploit the unique characteristics of HiNM pruning, incorporating a strategic policy in each permutation phase, including channel sampling, clustering, and assignment, to circumvent local minima. Additionally, we have developed a GPU kernel that facilitates independent layer permutation during the execution of HiNM sparse networks. Our extensive experimental evaluations on various DNN models demonstrate that our gyro-permutation significantly enhances the accuracy of HiNM sparse networks, allowing them to reach performance levels comparable to those of unstructured sparse networks.

Determining optimal channel partition for 2:4 fine grained structured sparsity

SUBP: Soft Uniform Block Pruning for 1 X N Sparse CNNs Multithreading Acceleration

SUBP: Soft Uniform Block Pruning for 1xn Sparse CNNs Multithreading Acceleration

Channel Permutations for N:M Sparsity

A Pruning Method Based on the Dissimilarity of Angle among Channels and Filters

An Automatically Layer-wise Searching Strategy for Channel Pruning Based on Task-driven Sparsity Optimization

Multiple Sparse Spaces Network Pruning Via a Joint Similarity Criterion

Multi-Dimensional Dynamic Pruning: Exploring Spatial and Channel Fuzzy Sparsity

Exploiting Weight-Level Sparsity in Channel Pruning with Low-Rank Approximation

Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs

Three-Stage Global Channel Pruning for Resources-Limited Platform

Revisiting Random Channel Pruning for Neural Network Compression

Probability-Based Channel Pruning for Depthwise Separable Convolutional Networks

A Clustering Pruning Method Based on Multidimensional Channel Information

1$\Times$n Block Pattern for Network Sparsity

Exploiting Channel Similarity for Network Pruning.

An Effective Information Theoretic Framework for Channel Pruning

Exploiting Channel Similarity for Accelerating Deep Convolutional Neural Networks

Exploring the Regularity of Sparse Structure in Convolutional Neural Networks

Channel Pruning via Automatic Structure Search

Channel Pruning Method Based on Decoupling Feature Scale Distribution in Batch Normalization Layers