Abstract:Recently, differentiable mask pruning methods optimize the continuous relaxation architecture (soft network) as the proxy of the pruned discrete network (hard network) for superior sub-architecture search. However, due to the agnostic impact of the discretization process, the hard network struggles with the equivalent representational capacity as the soft network, namely discretization gap, which severely spoils the pruning performance. In this paper, we first investigate the discretization gap and propose a novel structural differentiable mask pruning framework named S2HPruner to bridge the discretization gap in a one-stage manner. In the training procedure, SH2Pruner forwards both the soft network and its corresponding hard network, then distills the hard network under the supervision of the soft network. To optimize the mask and prevent performance degradation, we propose a decoupled bidirectional knowledge distillation. It blocks the weight updating from the hard to the soft network while maintaining the gradient corresponding to the mask. Compared with existing pruning arts, S2HPruner achieves surpassing pruning performance without fine-tuning on comprehensive benchmarks, including CIFAR-100, Tiny ImageNet, and ImageNet with a variety of network architectures. Besides, investigation and analysis experiments explain the effectiveness of S2HPruner. Codes will be released soon.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of **discretization gap in the Differentiable Mask Pruning (DMP) method**. Specifically, the DMP method optimizes the continuous relaxation architecture (i.e., the soft network) as a performance proxy for the discretized compact pruning network (i.e., the hard network). However, due to the influence of the discretization process, the hard network often cannot be equivalent to the soft network in terms of representation ability, resulting in a significant performance degradation. This phenomenon is called **discretization gap**. To solve this problem, the author proposes a new structured differentiable mask pruning framework, called **S2HPruner**. This framework reduces the discretization gap by introducing a **soft - to - hard distillation** mechanism, which simultaneously forward - propagates the soft network and the corresponding hard network during the training process and distills the hard network under the supervision of the soft network. In addition, to prevent performance degradation, the author also proposes a **decoupled bidirectional knowledge distillation** method, which prevents weight updates from being passed from the hard network to the soft network but maintains the gradients related to the masks. ### Main contributions 1. **Reveal and solve the long - overlooked discretization gap problem**: By introducing the soft - to - hard distillation mechanism, the performance gap between the hard network and the soft network is reduced. 2. **Propose a new differentiable mask pruning framework S2HPruner**: This framework not only optimizes the soft network but also improves the performance of the hard network through the distillation mechanism. 3. **Propose decoupled bidirectional knowledge distillation**: This method selectively allows the gradient transfer of model weights and mask parameters, avoiding performance degradation. ### Experimental results Experiments show that S2HPruner performs well on multiple mainstream datasets (such as CIFAR - 100, Tiny ImageNet, and ImageNet) and various network architectures, and can achieve superior pruning performance without additional fine - tuning. For example, on the ImageNet dataset, S2HPruner maintains a Top - 1 accuracy of 96.17% at about 15% FLOPs (73.23% vs. 76.15%), significantly outperforming other methods. ### Summary This paper effectively solves the discretization gap problem in the differentiable mask pruning method by introducing the soft - to - hard distillation mechanism, significantly improving the network performance after pruning.

S2HPruner: Soft-to-Hard Distillation Bridges the Discretization Gap in Pruning

A Feature-map Discriminant Perspective for Pruning Deep Neural Networks

Class-Aware Pruning for Efficient Neural Networks

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

Loss Constrains Added Squeeze and Excitation Blocks for Pruning Deep Neural Networks

Pruning by Training: A Novel Deep Neural Network Compression Framework for Image Processing.

Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks

DHP: Differentiable Meta Pruning via HyperNetworks

Dynamic Probabilistic Pruning: A General Framework for Hardware-Constrained Pruning at Different Granularities

A Dynamic Pruning Method on Multiple Sparse Structures in Deep Neural Networks

Multi-Dimensional Dynamic Pruning: Exploring Spatial and Channel Fuzzy Sparsity

Knapsack Pruning with Inner Distillation

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

Towards Fairness-aware Adversarial Network Pruning

Structured Pruning Learns Compact and Accurate Models

Non-Parametric Adaptive Network Pruning

Enhanced Sparsification via Stimulative Training

Efficient Training Acceleration via Sample-Wise Dynamic Probabilistic Pruning

Manifold Regularized Dynamic Network Pruning.