Abstract:By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

What problem does this paper attempt to address?

This paper discusses the impact of increasing width and depth on performance in deep neural networks. The research primarily focuses on depth parameterization, particularly the block multiplier and learning rate in deep residual networks (ResNets). The authors propose a new approach called Depth-µP, which extends the Maximal Update Parametrization (µP) to address the hyperparameter transfer problem when increasing depth. Depth-µP maximizes feature learning and diversity within single-layer blocks of ResNets, allowing for hyperparameter transfer between networks at different depths. In infinite-width ResNets, Depth-µP ensures the stability of feature learning by adjusting the scale factor (inversely proportional to the square root of depth) and learning rate (inversely proportional to the square root of depth) for each residual block. The study finds that absolute value nonlinearity maximizes feature diversity, thereby improving performance. However, limitations exist for depth increases in multi-layer blocks (such as modern Transformers), which have been shown theoretically and practically. These limitations demonstrate that all possible infinite-depth restrictions have their own limitations. The paper also points out that while increasing depth generally improves performance, exceeding a certain number of layers may result in performance degradation or require hyperparameter adjustments. Depth-µP addresses this issue by providing a comprehensive strategy for width and depth expansion, ensuring maximum feature learning and reducing the cost of hyperparameter tuning. Furthermore, the paper reveals inherent problems for blocks with depth greater than 2, as the interaction between weights becomes additive rather than multiplicative, affecting diversity and causing performance decline. Therefore, current hyperparameter transfer methods need to be reconsidered. In conclusion, this paper attempts to address the problem of effectively increasing network depth in deep neural networks without sacrificing performance and training stability. It proposes a new depth expansion strategy called Depth-µP to achieve better feature learning and hyperparameter transfer.

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Feature Learning in Infinite-Width Neural Networks

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Depth Selection for Deep ReLU Nets in Feature Extraction and Generalization

Implicit Acceleration and Feature Learning in Infinitely Wide Neural Networks with Bottlenecks

Low-dimensional Intrinsic Dimension Reveals a Phase Transition in Gradient-Based Learning of Deep Neural Networks

On Infinite-Width Hypernetworks

Disentangling feature and lazy training in deep neural networks

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

Disentangling Trainability and Generalization in Deep Neural Networks

Features are fate: a theory of transfer learning in high-dimensional regression

Tensor Programs IVb: Adaptive Optimization in the Infinite-Width Limit

Infinite Limits of Multi-head Transformer Dynamics

Depth Separation with Multilayer Mean-Field Networks

Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think

Width and Depth Limits Commute in Residual Networks

On the Depth of Deep Neural Networks: A Theoretical View

On the Role of Depth and Looping for In-Context Learning with Task Diversity