Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

Greg Yang,Dingli Yu,Chen Zhu,Soufiane Hayou
2023-10-13
Abstract:By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.
Neural and Evolutionary Computing,Disordered Systems and Neural Networks,Probability
What problem does this paper attempt to address?
This paper discusses the impact of increasing width and depth on performance in deep neural networks. The research primarily focuses on depth parameterization, particularly the block multiplier and learning rate in deep residual networks (ResNets). The authors propose a new approach called Depth-µP, which extends the Maximal Update Parametrization (µP) to address the hyperparameter transfer problem when increasing depth. Depth-µP maximizes feature learning and diversity within single-layer blocks of ResNets, allowing for hyperparameter transfer between networks at different depths. In infinite-width ResNets, Depth-µP ensures the stability of feature learning by adjusting the scale factor (inversely proportional to the square root of depth) and learning rate (inversely proportional to the square root of depth) for each residual block. The study finds that absolute value nonlinearity maximizes feature diversity, thereby improving performance. However, limitations exist for depth increases in multi-layer blocks (such as modern Transformers), which have been shown theoretically and practically. These limitations demonstrate that all possible infinite-depth restrictions have their own limitations. The paper also points out that while increasing depth generally improves performance, exceeding a certain number of layers may result in performance degradation or require hyperparameter adjustments. Depth-µP addresses this issue by providing a comprehensive strategy for width and depth expansion, ensuring maximum feature learning and reducing the cost of hyperparameter tuning. Furthermore, the paper reveals inherent problems for blocks with depth greater than 2, as the interaction between weights becomes additive rather than multiplicative, affecting diversity and causing performance decline. Therefore, current hyperparameter transfer methods need to be reconsidered. In conclusion, this paper attempts to address the problem of effectively increasing network depth in deep neural networks without sacrificing performance and training stability. It proposes a new depth expansion strategy called Depth-µP to achieve better feature learning and hyperparameter transfer.