Enhanced Sparsification via Stimulative Training

Shengji Tang,Weihao Lin,Hancheng Ye,Peng Ye,Chong Yu,Baopu Li,Tao Chen
DOI: https://doi.org/10.48550/arXiv.2403.06417
2024-03-11
Abstract:Sparsification-based pruning has been an important category in model compression. Existing methods commonly set sparsity-inducing penalty terms to suppress the importance of dropped weights, which is regarded as the suppressed sparsification paradigm. However, this paradigm inactivates the dropped parts of networks causing capacity damage before pruning, thereby leading to performance degradation. To alleviate this issue, we first study and reveal the relative sparsity effect in emerging stimulative training and then propose a structured pruning framework, named STP, based on an enhanced sparsification paradigm which maintains the magnitude of dropped weights and enhances the expressivity of kept weights by self-distillation. Besides, to find an optimal architecture for the pruned network, we propose a multi-dimension architecture space and a knowledge distillation-guided exploration strategy. To reduce the huge capacity gap of distillation, we propose a subnet mutating expansion technique. Extensive experiments on various benchmarks indicate the effectiveness of STP. Specifically, without fine-tuning, our method consistently achieves superior performance at different budgets, especially under extremely aggressive pruning scenarios, e.g., remaining 95.11% Top-1 accuracy (72.43% in 76.15%) while reducing 85% FLOPs for ResNet-50 on ImageNet. Codes will be released soon.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to maintain or even improve the model performance while reducing the number of model parameters during the model pruning process. Specifically, existing pruning methods usually suppress the importance of the pruned parts by introducing a sparsity penalty term, and this method is called the suppressed sparsification paradigm. However, this paradigm leads to a loss of network capacity before pruning, thus affecting the model performance. To alleviate this problem, the paper proposes an enhanced sparsification paradigm and realizes a structured pruning framework (STP) through stimulative training (ST), aiming to maintain the weight magnitude of the pruned parts through self - distillation and enhance the expressive ability of the retained parts, so as to achieve high - performance model compression under different resource constraints without fine - tuning. ### Main Contributions 1. **Reveal the Relative Sparsity Effect**: For the first time, the paper reveals the relative sparsity effect in stimulative training. This effect can maximize the retention of the unpruned network capacity before pruning, thereby reducing the performance degradation after pruning. 2. **Propose the STP Framework**: Based on the enhanced sparsification paradigm, the paper proposes a new pruning framework - stimulative training - guided pruning (STP). STP includes three key designs: - **Knowledge Distillation - Guided Exploration**: Through a heuristic search strategy driven by the knowledge distillation loss (KD loss), gradually explore the optimal sub - network architecture. - **Multi - Dimensional Sampling**: Expand the sampling dimension of stimulative training, considering not only the depth (number of layers) but also the width (number of output channels per layer) to better balance sparsity and performance. - **Sub - Network Mutation Expansion**: Generate larger sub - networks to enrich the sub - network capacity and improve the performance of the main network. 3. **Extensive Experimental Verification**: Extensive experiments were carried out on multiple mainstream models (ResNet - 50, WRN28 - 10, MobileNetV3, ViT, Swin Transformer) and datasets (CIFAR - 100, TinyImageNet, ImageNet, COCO). The results show that STP can obtain high - performance and extremely low - FLOP compact networks without fine - tuning. For example, on the ImageNet dataset, STP can still maintain a Top - 1 accuracy of 95.11% (increased from 76.15% to 72.43%) while reducing 85% of FLOPs. ### Related Work 1. **Sparsification - Based Pruning**: Existing methods can be divided into static sparsification and dynamic sparsification. Static sparsification applies a globally invariant penalty intensity to all parameters, while dynamic sparsification considers the individual and time - varying sparsification intensities of different parameters. 2. **Knowledge Distillation in Pruning**: Knowledge distillation is usually used after pruning to compensate for performance degradation. This paper uses knowledge distillation itself as part of the pruning method, uses self - distillation for sparsification, and obtains the pruning mask through the KD loss. 3. **Stimulative Training**: Stimulative training improves the performance of the main network by transferring the knowledge of the main network to each deep sub - network. This paper expands the sampling dimension of stimulative training and uses its relative sparsity effect to achieve enhanced sparsification. ### Method 1. **Problem Definition**: The goal of pruning is to eliminate parameters from a given network to obtain a high - performance compact sub - network under specific resource constraints. 2. **Relative Sparsity Effect**: By fixing a specific sub - network for stimulative training and observing the distribution of the convolutional layer weights, it is found that stimulative training can significantly enhance the selected weights, resulting in a relatively concentrated weight. 3. **Framework**: The STP framework includes steps such as initializing the architecture pool, randomly initializing parameters, sampling sub - network architectures, forward - propagating the main network, applying sub - network architecture masks, supervising sub - networks, mutating and expanding sub - networks, and updating the architecture pool. Through these steps, STP gradually explores the optimal sub - network architecture and uses stimulative training to enhance the sub - network. ### Experiment 1. **Image Classification Tasks**: Experiments were carried out on the CIFAR - 100, TinyImageNet and ImageNet datasets.