Always-Sparse Training by Growing Connections with Guided Stochastic Exploration

Mike Heddes,Narayan Srinivasa,Tony Givargis,Alexandru Nicolau
2024-01-13
Abstract:The excessive computational requirements of modern artificial neural networks (ANNs) are posing limitations on the machines that can run them. Sparsification of ANNs is often motivated by time, memory and energy savings only during model inference, yielding no benefits during training. A growing body of work is now focusing on providing the benefits of model sparsification also during training. While these methods greatly improve the training efficiency, the training algorithms yielding the most accurate models still materialize the dense weights, or compute dense gradients during training. We propose an efficient, always-sparse training algorithm with excellent scaling to larger and sparser models, supported by its linear time complexity with respect to the model width during training and inference. Moreover, our guided stochastic exploration algorithm improves over the accuracy of previous sparse training methods. We evaluate our method on CIFAR-10/100 and ImageNet using ResNet, VGG, and ViT models, and compare it against a range of sparsification methods.
Machine Learning
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to solve the problem of excessive computational resource requirements in the training process of modern artificial neural networks (ANNs). Although sparsification techniques have been able to bring savings in time, memory, and energy during the model inference stage, these techniques usually cannot provide similar benefits during the training stage. Some existing sparse training methods have improved training efficiency, but still require computationally - intensive gradients or storage - intensive weights during the training process, which limits their application to large - scale sparse models. To this end, the authors propose an efficient, always - sparse training algorithm - **Guided Stochastic Exploration (GSE)**. This algorithm not only remains sparse during both the training and inference stages, but also has a linear - time complexity and is suitable for larger and sparser models. In addition, GSE improves the accuracy of sparse training methods through the guided stochastic exploration strategy. ### Main contributions 1. **Always - sparse training algorithm**: The GSE algorithm remains sparse throughout the training process, without computing computationally - intensive gradients or storing storage - intensive weights, thereby significantly reducing computational and memory requirements. 2. **Efficient subset sampling**: By randomly sampling subsets and selecting the connections with the largest gradients for growth, GSE can effectively explore network connections while ensuring sparsity. 3. **Linear - time complexity**: The time complexity of GSE is \(O(n + \text{NNZ})\), where \(n\) is the width of the model and \(\text{NNZ} \leq n^2\) is the number of non - zero weights. For the common Erdős–Rényi model initialization, the time complexity is simplified to \(O(n)\). 4. **Experimental verification**: The authors conducted experiments on the CIFAR - 10/100 and ImageNet datasets using ResNet, VGG, and ViT models and compared them with various sparsification methods. The results show that GSE can still achieve high accuracy at high sparsity levels. ### Experimental results - **CIFAR - 10 and CIFAR - 100**: GSE achieved accuracy comparable to or higher than existing methods at 90%, 95%, and 98% sparsity levels. - **ImageNet**: GSE also performs well on large - scale models, especially at high sparsity levels, where its accuracy exceeds that of other sparse training methods. - **FLOPs comparison**: The amount of computation of GSE is significantly lower than that of methods such as RigL, especially at high sparsity levels. ### Conclusion The GSE algorithm successfully solves the problem of computational resource requirements in the training process of large - scale sparse models through an always - sparse training strategy and an efficient subset sampling method while maintaining high accuracy. This method provides new possibilities for future training of larger - scale and sparser models.