Abstract:Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. A key design choice is given by the sparse initialization, which determines the trainable sub-network through a binary mask. Existing methods mainly select such mask based on a predefined dense initialization. Such an approach may not efficiently leverage the mask's potential impact on the optimization. An alternative direction, inspired by research into dynamical isometry, is to introduce orthogonality in the sparse subnetwork, which helps in stabilizing the gradient signal. In this work, we propose Exact Orthogonal Initialization (EOI), a novel sparse orthogonal initialization scheme based on composing random Givens rotations. Contrary to other existing approaches, our method provides exact (not approximated) orthogonality and enables the creation of layers with arbitrary densities. We demonstrate the superior effectiveness and efficiency of EOI through experiments, consistently outperforming common sparse initialization techniques. Our method enables training highly sparse 1000-layer MLP and CNN networks without residual connections or normalization techniques, emphasizing the crucial role of weight initialization in static sparse training alongside sparse mask selection. The code is available at <a class="link-external link-https" href="https://github.com/woocash2/sparser-better-deeper-stronger" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the effectiveness of sparse initialization in Static Sparse Training (SST). Specifically, existing static sparse training methods mainly rely on predefined dense initialization to select trainable sub - networks, which may not fully utilize the influence of masks on the optimization process. Moreover, although the introduction of orthogonality can help stabilize the gradient signal, existing methods usually can only provide approximate orthogonality and have limitations in terms of flexibility and applicability. To solve these problems, this paper proposes a new sparse orthogonal initialization scheme - Exact Orthogonal Initialization (EOI). EOI generates sparse orthogonal matrices by combining random Givens rotations, thus ensuring exact orthogonality and being able to achieve arbitrary sparsity while maintaining high efficiency. This method not only improves the performance and efficiency of static sparse training but also makes it possible to train very sparse deep neural networks (such as 1000 - layer MLPs and CNNs) without the need for residual connections or normalization techniques. ### Specific Problem Description 1. **Limitations of Existing Methods**: - Existing static sparse training methods rely on predefined dense initialization to select trainable sub - networks, which may lead to the failure to fully utilize the potential of masks in the optimization process. - Existing methods usually can only provide approximate orthogonality and have limitations in terms of architecture compatibility and sparsity per layer. 2. **Necessity of Introducing Orthogonality**: - Orthogonality helps stabilize the gradient signal, thereby improving the training effect of the model, especially in deep networks. - Previous studies have shown that orthogonal initialization can significantly improve signal propagation, enabling the network to be trained to thousands of layers without the need for residual connections or normalization layers. 3. **Goals and Contributions**: - Propose a new sparse orthogonal initialization scheme (EOI) that can provide exact orthogonality and support arbitrary sparsity. - Verify the superior performance of EOI in static sparse training, especially when training highly sparse deep networks. - Verify the effectiveness of EOI under multiple activation functions and different sparsity levels through experiments. ### Solutions The paper proposes the following solutions: - **Exact Orthogonal Initialization (EOI)**: Generate sparse orthogonal matrices by combining random Givens rotations to ensure exact orthogonality and be able to achieve arbitrary sparsity while maintaining high efficiency. - **Experimental Verification**: Demonstrate the superior performance of EOI in static sparse training through experiments, especially when training highly sparse deep networks. ### Experimental Results The experimental results show that EOI is superior to existing sparse initialization methods in several aspects: - **Signal Propagation Stability**: EOI can better keep the singular values of the input - output Jacobian matrix close to 1, thus stabilizing signal propagation. - **Training Efficiency and Performance**: EOI makes it possible to train highly sparse 1000 - layer MLPs and CNNs and performs excellently in terms of test accuracy. - **Adaptability**: EOI is applicable to different activation functions and sparsity levels and has wide applicability. In conclusion, this paper solves the problem of the effectiveness of sparse initialization in static sparse training by proposing EOI and demonstrates its superior performance in training highly sparse deep networks.

Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

A mathematical framework for improved weight initialization of neural networks using Lagrange multipliers

Keep the Gradients Flowing: Using Gradient Flow to Study Sparse Network Optimization

Cyclic Sparse Training: Is it Enough?

Harnessing Orthogonality to Train Low-Rank Neural Networks

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks

Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

A weight initialization based on the linear product structure for neural networks

The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

Improving Training of Deep Neural Networks Via Singular Value Bounding

OptG: Optimizing Gradient-driven Criteria in Network Sparsity

Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint

A Sober Look at Neural Network Initializations

Learning Sparse Neural Networks with Identity Layers

Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs

Optimal Eye Surgeon: Finding Image Priors through Sparse Generators at Initialization

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme