FGGP: Fixed-Rate Gradient-First Gradual Pruning

Lingkai Zhu,Can Deniz Bezek,Orcun Goksel
2024-11-08
Abstract:In recent years, the increasing size of deep learning models and their growing demand for computational resources have drawn significant attention to the practice of pruning neural networks, while aiming to preserve their accuracy. In unstructured gradual pruning, which sparsifies a network by gradually removing individual network parameters until a targeted network sparsity is reached, recent works show that both gradient and weight magnitudes should be considered. In this work, we show that such mechanism, e.g., the order of prioritization and selection criteria, is essential. We introduce a gradient-first magnitude-next strategy for choosing the parameters to prune, and show that a fixed-rate subselection criterion between these steps works better, in contrast to the annealing approach in the literature. We validate this on CIFAR-10 dataset, with multiple randomized initializations on both VGG-19 and ResNet-50 network backbones, for pruning targets of 90, 95, and 98% sparsity and for both initially dense and 50% sparse networks. Our proposed fixed-rate gradient-first gradual pruning (FGGP) approach outperforms its state-of-the-art alternatives in most of the above experimental settings, even occasionally surpassing the upperbound of corresponding dense network results, and having the highest ranking across the considered experimental settings.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reduce the number of model parameters in deep - learning models through pruning techniques while maintaining the accuracy of the model. Specifically, the paper focuses on how to effectively reduce redundant parameters in neural networks through the method of gradual pruning without sacrificing model performance, thereby reducing the computational resource requirements of the model and improving the running efficiency of the model, especially for applications on edge computing and mobile devices. ### Main Contributions of the Paper 1. **Definition and Review of the Multi - step Top - K Selection Process**: The paper provides a clear and transparent definition and review of the multi - step Top - K selection process, which is a crucial step in gradual pruning. 2. **Importance of Priority and Selection Criteria**: The paper shows that in a successful gradual pruning algorithm, the priority and criteria for selecting parameters are crucial, and these two factors are interrelated. 3. **Gradient - Priority Top - K Selection Criteria**: The paper proposes a gradient - priority Top - K selection criteria and finds that a fixed - proportion selection quota has a better effect than the annealing method in the literature. 4. **Latest Achievements on the CIFAR - 10 Dataset**: The paper has reached a new state - of - the - art level on the CIFAR - 10 dataset, especially for target sparsities of 90%, 95% and 98% under multiple random initializations and different network architectures such as VGG - 19 and ResNet - 50. ### Method Overview #### Pruning Scheduling The paper uses cubic sparsity scheduling to gradually reduce network parameters. The sparsity \( s \) is defined as the relationship between the number of network parameters \( N \) after pruning and its dense network parameter number \( N^* \): \[ N=(1 - s)N^* \] During the training process, from the initial sparsity \( s_{\text{ini}} \) to the target sparsity \( s_{\text{fin}} \), the formula for calculating the sparsity \( s_t \) at the \( t \) - th iteration is: \[ s_t = s_{\text{fin}}+(s_{\text{ini}} - s_{\text{fin}})\left(1-\frac{t - t_{\text{ini}}}{t_{\text{fin}} - t_{\text{ini}}}\right)^3 \] #### Pruning Strategy The paper proposes a two - step selection process to determine the parameters for pruning: 1. **Sort by Gradient Magnitude**: First, sort the parameters according to the gradient magnitude \( |g_i| \), and select the \( r\cdot N_{t-\Delta t} \) parameters with the smallest gradients. 2. **Sort by Weight Magnitude**: Then, sort these parameters according to the weight magnitude \( |\theta_i| \), and select the \( N_p \) parameters with the smallest weights for pruning. This method avoids directly selecting based on weight magnitude on unconverged parameters, thereby improving the pruning effect. ### Experimental Results The paper conducted extensive experiments on the CIFAR - 10 dataset, using two network architectures, VGG - 19 and ResNet - 50. The experimental results show that the method proposed in the paper (FGGP) outperforms the existing state - of - the - art methods in most experimental settings, and even exceeds the upper - limit performance of dense networks in some cases. ### Conclusion By introducing the gradient - priority gradual pruning method, the paper successfully reduces the number of model parameters significantly while maintaining the model performance. This not only helps to reduce the demand for computational resources but also makes it possible to deploy complex models on edge computing and mobile devices.