Abstract:Convolutional neural networks (CNNs) are reported to be overparametrized. The search for optimal (minimal) and sufficient architecture is an NP-hard problem as the hyperparameter space for possible network configurations is vast. Here, we introduce a layer-by-layer data-driven pruning method based on the mathematical idea aiming at a computationally-scalable entropic relaxation of the pruning problem. The sparse subnetwork is found from the pre-trained (full) CNN using the network entropy minimization as a sparsity constraint. This allows deploying a numerically scalable algorithm with a sublinear scaling cost. The method is validated on several benchmarks (architectures): (i) MNIST (LeNet) with sparsity 55%-84% and loss in accuracy 0.1%-0.5%, and (ii) CIFAR-10 (VGG-16, ResNet18) with sparsity 73-89% and loss in accuracy 0.1%-0.5%.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper attempts to solve the over - parameterization problem in convolutional neural networks (CNNs). Specifically, the author proposes a data - driven layer - by - layer sparsification method, optimizing the structure of CNNs by introducing entropy relaxation in mathematics. This method aims to find a sparse sub - network of a pre - trained complete CNN while maintaining high performance. The main contributions include:
1. **Algorithm adaptation**: Adapt the SPARTAn algorithm to the sparsification of convolutional layers, demonstrating that the algorithm also has sub - linear cost expansion and the ability to handle small data in convolutional layers with arbitrary support.
2. **Verification of effectiveness**: Verify the effectiveness of this method on the MNIST and CIFAR - 10 datasets, using multiple convolutional network architectures (such as LeNet, VGG - 16, ResNet18). For example, on CIFAR - 10, 89% of the weights can be removed from VGG - 16 with a performance loss of less than 0.1%.
3. **Redundancy analysis**: Determine which layers have the most redundancy and can be pruned more.
4. **Weight importance**: Explore that the value of network pruning lies in finding the optimal network architecture rather than specific weight values. By training randomly initialized weights from scratch, the usefulness of retaining the weights of the pre - trained sparse model is verified.
### Main methods
1. **Entropy regularization**: Achieve network sparsification by minimizing entropy as a sparse constraint in the regression problem.
2. **Layer - by - layer sparsification**: Interpret the convolutional layer as a fully - connected layer and achieve sparsification by solving a linear regression task with entropy regularization.
3. **Experimental verification**: Conduct experiments on multiple benchmark datasets and network architectures to verify the effectiveness and practicality of the method.
### Formula representation
- **Discrete Shannon entropy**:
\[
H(w)=-\sum_{i = 1}^{d}w_{i}\log w_{i}
\]
where \(w=(w_{1},\ldots,w_{d})\in\mathbb{R}_{\geq0}^{d}\) and \(\sum_{i = 1}^{d}w_{i}=1\)
- **Sparse entropy regression loss function**:
\[
L_{\text{sparsify}}(w,\Lambda)=\epsilon_{w}\sum_{d = 1}^{D}w_{d}\log w_{d}+\epsilon_{l2}\sum_{m = 1}^{M}\sum_{d = 1}^{k^{2}D}\Lambda_{m,d}^{2}+\frac{1}{T}\sum_{t = 1}^{T}\sum_{m = 1}^{M}\left(Y_{m,t}-\Lambda_{m,0}-\sum_{d = 1}^{D}w_{d}\sum_{l = 1}^{k^{2}}\Lambda_{m,(d - 1)k^{2}+l}X_{(d - 1)k^{2}+l,t}\right)^{2}
\]
### Experimental results
- **Performance of LeNet on MNIST**:
- Sparsifying only the convolutional layers, reducing the number of parameters by 40% and 60%, the performance drops by 0.3% and 1.69% respectively.
- Sparsifying the entire network, reducing the number of parameters from 55% to 84%, the performance drop is between 0.1% and 0.5%.
- **Performance of VGG - 16 on CIFAR - 10**:
- By selectively sparsifying different groups of convolutional layers, reducing the number of parameters from 54.59% to 99.95%, the performance drop is between 0.1% and 3.21%.
### Conclusion
The method proposed in this paper can effectively reduce the number of parameters in convolutional neural networks while maintaining high performance, providing an optimization scheme for large - scale applications.