Abstract:The success of deep neural networks in real-world problems has prompted many attempts to explain their training dynamics and generalization performance, but more guiding principles for the training of neural networks are still needed. Motivated by the edge of chaos principle behind the optimal performance of neural networks, we study the role of various hyperparameters in modern neural network training algorithms in terms of the order-chaos phase diagram. In particular, we study a fully analytical feedforward neural network trained on the widely adopted Fashion-MNIST dataset, and study the dynamics associated with the hyperparameters in back-propagation during the training process. We find that for the basic algorithm of stochastic gradient descent with momentum, in the range around the commonly used hyperparameter values, clear scaling relations are present with respect to the training time during the ordered phase in the phase diagram, and the model's optimal generalization power at the edge of chaos is similar across different training parameter combinations. In the chaotic phase, the same scaling no longer exists. The scaling allows us to choose the training parameters to achieve faster training without sacrificing performance. In addition, we find that the commonly used model regularization method - weight decay - effectively pushes the model towards the ordered phase to achieve better performance. Leveraging on this fact and the scaling relations in the other hyperparameters, we derived a principled guideline for hyperparameter determination, such that the model can achieve optimal performance by saturating it at the edge of chaos. Demonstrated on this simple neural network model and training algorithm, our work improves the understanding of neural network training dynamics, and can potentially be extended to guiding principles of more complex model architectures and algorithms.

Universal Scaling Laws of Absorbing Phase Transitions in Artificial Deep Neural Networks

Extracting Critical Exponents by Finite-Size Scaling with Convolutional Neural Networks

Low-dimensional Intrinsic Dimension Reveals a Phase Transition in Gradient-Based Learning of Deep Neural Networks

Edge of chaos as a guiding principle for modern neural network training

Criticality & Deep Learning I: Generally Weighted Nets

Criticality versus uniformity in deep neural networks

Universal scaling behavior of non-equilibrium phase transitions

Learning in PINNs: Phase transition, total diffusion, and generalization

Charting the Topography of the Neural Network Landscape with Thermal-Like Noise

Unveiling the intrinsic dynamics of biological and artificial neural networks: from criticality to optimal representations

Speed Limits for Deep Learning

Scaling ResNets in the Large-depth Regime

Neural Scaling Laws Rooted in the Data Distribution

Critical feature learning in deep neural networks

Phase Diagram for Two-layer ReLU Neural Networks at Infinite-width Limit.

Quasi-universal scaling in mouse-brain neuronal activity stems from edge-of-instability critical dynamics

Order and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts

Universality Class of Machine Learning for Critical Phenomena

Scaling description of generalization with number of parameters in deep learning

Understanding Artificial Neural Network's Behavior from Neuron Activation Perspective

A Dynamical Model of Neural Scaling Laws