What problem does this paper attempt to address?

The problem that this paper attempts to solve is: When training a neural network to solve a specific task, is there a tendency to use all available weights, even if the task can be completed with fewer weights? In other words, the authors studied whether neural networks would utilize all available parameters during the training process, even if the task itself does not require so many parameters. Specifically, they explored this issue by pruning (pruning) fully - connected, convolutional, and residual models and varying the widths of these models. ### Research Background and Motivation 1. **Computational Cost and Transparency** - Although large - scale dense neural networks can achieve impressive performance, their deployment is accompanied by high computational costs and a lack of transparency regarding their behavior. 2. **Importance of Sparsity** - In order to reduce computational costs and improve the interpretability of deep - learning systems, sparsity has become the goal of many researchers. 3. **Analogy to Parkinson's Law** - The authors, by analogy to Parkinson's law, that is, "The time required to complete a bureaucratic task will expand to fill the available time", raised a similar question: Will the capacity of a neural network be fully utilized as the available parameters increase? ### Research Methods - **Pruning Strategy** - The authors determined the size of the "core" model by gradually pruning the weights with the lowest magnitudes until the model accuracy decreased by 5%. The number of parameters in the core model is denoted as \( |\omega| \), the number of parameters in the original model is denoted as \( |\theta| \), and the effective density is defined as \( \frac{|\omega|}{|\theta|} \). - **Experimental Setup** - Vary the width of the model (from 0.1 times to 5 times the default width) and evaluate the changes in the effective density of the model at different widths. - Conduct experiments using different optimizers (such as SGD, Adam, Adagrad) and initialization schemes (such as Glorot, He). ### Main Findings 1. **Rejection of the Null Hypothesis** - The null hypothesis states that for a given architecture, training mechanism, and task, the size of the core model \( |\omega| \) does not change with the model width. However, the experimental results show that as the model width increases, the size of the core model also increases significantly, which contradicts the null hypothesis. 2. **Similarity of Effective Densities** - Although the model width has increased, the effective densities of models with different widths are relatively consistent, indicating that wide models do not significantly reduce redundant parameters. 3. **Influence of Optimizers and Architectures** - Different optimizers and architectures have different effects on the effective density of the model. For example, the convolutional model exhibits a higher effective density when using the Adam optimizer, while the fully - connected model exhibits a lower effective density when using SGD. 4. **Influence of Initialization Schemes** - Glorot initialization usually results in a lower effective density, while He initialization results in a higher effective density. However, this difference is relatively small, indicating that the cause of the density tendency may not lie in the initialization. ### Conclusions This study shows that neural networks do tend to use more parameters during the training process, even if the task itself does not require so many parameters. There may be multiple mechanisms behind this phenomenon, including the initial weight distribution, differences in model functions, and task separation during the training process. Future research needs to further explore the specific roles of these mechanisms. ### Formula Summary - **Effective Density** \[ \text{Effective Density} = \frac{|\omega|}{|\theta|} \] where \( |\omega| \) is the number of parameters in the core model and \( |\theta| \) is the number of parameters in the original model. - **Hoyer Sparsity** \[ \text{Hoyer Sparsity} = \frac{\sqrt{n} - \frac{\sum_{i = 1}^n |x_i|}{\sqrt{\sum_{i = 1}^n x_i^2}}}{\sqrt{n} - 1} \] where \( n \) is the length of the vector and \( x_

The Propensity for Density in Feed-forward Models

Class-Aware Pruning for Efficient Neural Networks

Small Contributions, Small Networks: Efficient Neural Network Pruning Based on Relative Importance

A Probabilistic Approach to Neural Network Pruning

Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity

Statistical Mechanical Analysis of Neural Network Pruning

Exploring Weight Importance and Hessian Bias in Model Pruning

Optimizing Dense Feed-Forward Neural Networks

To prune or not to prune : A chaos-causality approach to principled pruning of dense neural networks

Frivolous Units: Wider Networks Are Not Really That Wide

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Effective Sparsification of Neural Networks with Global Sparsity Constraint

Neural Network Pruning as Spectrum Preserving Process

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Fine Granularity Is Critical for Intelligent Neural Network Pruning

Detecting Dead Weights and Units in Neural Networks

Students and teachers learning together: a robust training strategy for neural network pruning

Rethinking the Value of Network Pruning

How Sparse Can We Prune A Deep Network: A Fundamental Limit Viewpoint

The Generalization-Stability Tradeoff In Neural Network Pruning

Is Complexity Required for Neural Network Pruning? A Case Study on Global Magnitude Pruning