Abstract:We develop an approach to efficiently grow neural networks, within which parameterization and optimization strategies are designed by considering their effects on the training dynamics. Unlike existing growing methods, which follow simple replication heuristics or utilize auxiliary gradient-based local optimization, we craft a parameterization scheme which dynamically stabilizes weight, activation, and gradient scaling as the architecture evolves, and maintains the inference functionality of the network. To address the optimization difficulty resulting from imbalanced training effort distributed to subnetworks fading in at different growth phases, we propose a learning rate adaption mechanism that rebalances the gradient contribution of these separate subcomponents. Experimental results show that our method achieves comparable or better accuracy than training large fixed-size models, while saving a substantial portion of the original computation budget for training. We demonstrate that these gains translate into real wall-clock training speedups.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the high cost of training large-scale neural networks and proposes an efficient method for scaling neural networks. Specifically, the goals of the paper include: 1. **Improving Training Efficiency**: - By gradually increasing the network width, starting from a smaller model and progressively expanding to a larger model, thereby saving a significant amount of computational resources. - Compared to directly training a large fixed-size model, this method not only achieves or surpasses its accuracy but also significantly reduces computational costs. 2. **Maintaining Functional Continuity**: - Ensuring that newly added parameters do not disrupt the functionality of the original network during the expansion process. - Using parameterization schemes (such as variance shifting) to dynamically stabilize weights, activations, and gradient scaling, allowing the network to maintain its inference capabilities during expansion. 3. **Optimizing Learning Rate Scheduling**: - Addressing the training imbalance between sub-networks introduced at different growth stages by proposing a learning rate adaptation mechanism that rebalances the gradient contributions of each sub-component. - Solving the issue of different training durations for sub-networks at various growth stages through phased learning rate adjustments. 4. **Broad Applicability and Acceleration Effects**: - This method is not only applicable to image classification tasks but also to other tasks such as machine translation, and it performs well across various network architectures. - Experimental results show that this method can bring significant acceleration effects in actual training time. In summary, this paper addresses efficiency and optimization issues in large-scale model training through a novel network expansion framework, achieving faster training speeds and higher accuracy.

Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation

Gradient amplification: An efficient way to train deep neural networks

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

Efficient Neural Network Training Via Forward and Backward Propagation Sparsification

Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

Accelerating Neural Network Training: A Brief Review

Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Time-, Memory- and Parameter-Efficient Visual Adaptation

NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep Neural Networks

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

Neurogenesis Dynamics-inspired Spiking Neural Network Training Acceleration

MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

Network Expansion for Practical Training Acceleration

Flatter, faster: scaling momentum for optimal speedup of SGD

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Intelligent gradient amplification for deep neural networks

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

An optimization Strategy for Deep Neural Networks Training

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

EvoGrad: Efficient Gradient-Based Meta-Learning and Hyperparameter Optimization

A Pseudoinverse Incremental Algorithm For Fast Training Deep Neural Networks With Application To Spectra Pattern Recognition