S2HPruner: Soft-to-Hard Distillation Bridges the Discretization Gap in Pruning

Weihao Lin,Shengji Tang,Chong Yu,Peng Ye,Tao Chen
2024-10-10
Abstract:Recently, differentiable mask pruning methods optimize the continuous relaxation architecture (soft network) as the proxy of the pruned discrete network (hard network) for superior sub-architecture search. However, due to the agnostic impact of the discretization process, the hard network struggles with the equivalent representational capacity as the soft network, namely discretization gap, which severely spoils the pruning performance. In this paper, we first investigate the discretization gap and propose a novel structural differentiable mask pruning framework named S2HPruner to bridge the discretization gap in a one-stage manner. In the training procedure, SH2Pruner forwards both the soft network and its corresponding hard network, then distills the hard network under the supervision of the soft network. To optimize the mask and prevent performance degradation, we propose a decoupled bidirectional knowledge distillation. It blocks the weight updating from the hard to the soft network while maintaining the gradient corresponding to the mask. Compared with existing pruning arts, S2HPruner achieves surpassing pruning performance without fine-tuning on comprehensive benchmarks, including CIFAR-100, Tiny ImageNet, and ImageNet with a variety of network architectures. Besides, investigation and analysis experiments explain the effectiveness of S2HPruner. Codes will be released soon.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **discretization gap in the Differentiable Mask Pruning (DMP) method**. Specifically, the DMP method optimizes the continuous relaxation architecture (i.e., the soft network) as a performance proxy for the discretized compact pruning network (i.e., the hard network). However, due to the influence of the discretization process, the hard network often cannot be equivalent to the soft network in terms of representation ability, resulting in a significant performance degradation. This phenomenon is called **discretization gap**. To solve this problem, the author proposes a new structured differentiable mask pruning framework, called **S2HPruner**. This framework reduces the discretization gap by introducing a **soft - to - hard distillation** mechanism, which simultaneously forward - propagates the soft network and the corresponding hard network during the training process and distills the hard network under the supervision of the soft network. In addition, to prevent performance degradation, the author also proposes a **decoupled bidirectional knowledge distillation** method, which prevents weight updates from being passed from the hard network to the soft network but maintains the gradients related to the masks. ### Main contributions 1. **Reveal and solve the long - overlooked discretization gap problem**: By introducing the soft - to - hard distillation mechanism, the performance gap between the hard network and the soft network is reduced. 2. **Propose a new differentiable mask pruning framework S2HPruner**: This framework not only optimizes the soft network but also improves the performance of the hard network through the distillation mechanism. 3. **Propose decoupled bidirectional knowledge distillation**: This method selectively allows the gradient transfer of model weights and mask parameters, avoiding performance degradation. ### Experimental results Experiments show that S2HPruner performs well on multiple mainstream datasets (such as CIFAR - 100, Tiny ImageNet, and ImageNet) and various network architectures, and can achieve superior pruning performance without additional fine - tuning. For example, on the ImageNet dataset, S2HPruner maintains a Top - 1 accuracy of 96.17% at about 15% FLOPs (73.23% vs. 76.15%), significantly outperforming other methods. ### Summary This paper effectively solves the discretization gap problem in the differentiable mask pruning method by introducing the soft - to - hard distillation mechanism, significantly improving the network performance after pruning.