Abstract:Does the process of training a neural network to solve a task tend to use all of the available weights even when the task could be solved with fewer weights? To address this question we study the effects of pruning fully connected, convolutional and residual models while varying their widths. We find that the proportion of weights that can be pruned without degrading performance is largely invariant to model size. Increasing the width of a model has little effect on the density of the pruned model relative to the increase in absolute size of the pruned network. In particular, we find substantial prunability across a large range of model sizes, where our biggest model is 50 times as wide as our smallest model. We explore three hypotheses that could explain these findings.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: When training a neural network to solve a specific task, is there a tendency to use all available weights, even if the task can be completed with fewer weights? In other words, the authors studied whether neural networks would utilize all available parameters during the training process, even if the task itself does not require so many parameters. Specifically, they explored this issue by pruning (pruning) fully - connected, convolutional, and residual models and varying the widths of these models.
### Research Background and Motivation
1. **Computational Cost and Transparency**
- Although large - scale dense neural networks can achieve impressive performance, their deployment is accompanied by high computational costs and a lack of transparency regarding their behavior.
2. **Importance of Sparsity**
- In order to reduce computational costs and improve the interpretability of deep - learning systems, sparsity has become the goal of many researchers.
3. **Analogy to Parkinson's Law**
- The authors, by analogy to Parkinson's law, that is, "The time required to complete a bureaucratic task will expand to fill the available time", raised a similar question: Will the capacity of a neural network be fully utilized as the available parameters increase?
### Research Methods
- **Pruning Strategy**
- The authors determined the size of the "core" model by gradually pruning the weights with the lowest magnitudes until the model accuracy decreased by 5%. The number of parameters in the core model is denoted as \( |\omega| \), the number of parameters in the original model is denoted as \( |\theta| \), and the effective density is defined as \( \frac{|\omega|}{|\theta|} \).
- **Experimental Setup**
- Vary the width of the model (from 0.1 times to 5 times the default width) and evaluate the changes in the effective density of the model at different widths.
- Conduct experiments using different optimizers (such as SGD, Adam, Adagrad) and initialization schemes (such as Glorot, He).
### Main Findings
1. **Rejection of the Null Hypothesis**
- The null hypothesis states that for a given architecture, training mechanism, and task, the size of the core model \( |\omega| \) does not change with the model width. However, the experimental results show that as the model width increases, the size of the core model also increases significantly, which contradicts the null hypothesis.
2. **Similarity of Effective Densities**
- Although the model width has increased, the effective densities of models with different widths are relatively consistent, indicating that wide models do not significantly reduce redundant parameters.
3. **Influence of Optimizers and Architectures**
- Different optimizers and architectures have different effects on the effective density of the model. For example, the convolutional model exhibits a higher effective density when using the Adam optimizer, while the fully - connected model exhibits a lower effective density when using SGD.
4. **Influence of Initialization Schemes**
- Glorot initialization usually results in a lower effective density, while He initialization results in a higher effective density. However, this difference is relatively small, indicating that the cause of the density tendency may not lie in the initialization.
### Conclusions
This study shows that neural networks do tend to use more parameters during the training process, even if the task itself does not require so many parameters. There may be multiple mechanisms behind this phenomenon, including the initial weight distribution, differences in model functions, and task separation during the training process. Future research needs to further explore the specific roles of these mechanisms.
### Formula Summary
- **Effective Density**
\[
\text{Effective Density} = \frac{|\omega|}{|\theta|}
\]
where \( |\omega| \) is the number of parameters in the core model and \( |\theta| \) is the number of parameters in the original model.
- **Hoyer Sparsity**
\[
\text{Hoyer Sparsity} = \frac{\sqrt{n} - \frac{\sum_{i = 1}^n |x_i|}{\sqrt{\sum_{i = 1}^n x_i^2}}}{\sqrt{n} - 1}
\]
where \( n \) is the length of the vector and \( x_