Abstract:Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.
What problem does this paper attempt to address?
This paper mainly discusses the effectiveness of the μ-Transfer method in large-scale neural network initialization and learning rate setting. μ-Transfer is a technique for transferring hyperparameters from small models to large models, aiming to solve the problem of inaccurate initialization and learning rate setting of large models due to the high cost of large-scale hyperparameter search. Although μ-Transfer has the potential to provide stable training and low-cost optimal hyperparameters, its actual effect, compatibility, and performance on very large models still need to be verified.
The study focuses on the Transformer architecture, and through extensive ablation studies and large-scale experiments (up to 10 billion parameters and 190 billion tokens), the authors found that μ-Transfer works as expected in most cases, but may not be applicable in certain specific cases. They also studied the compatibility of μ-Transfer with commonly used techniques in practice, such as decoupled weight decay, multiplicative non-linearity, etc., and found that in some cases, such as using trainable gains or excessively large attention ratios, μ-Transfer may not transfer the optimal learning rates effectively.
In addition, the paper points out that although μ-Transfer performs well in transferring from small-scale models to large-scale models, as the model size increases, there may be a slight drift in the optimal value of the learning rate. Finally, the authors suggest considering omitting certain components (such as trainable gains) to maintain the effectiveness of μ-Transfer when training large models, and point out that the transfer of μ-Transfer between different model sizes is beneficial for the optimization process.
In summary, this paper provides in-depth empirical research on the μ-Transfer method, revealing its applicability and limitations in large-scale Transformer models, and provides valuable insights for future optimization and simplification of large-scale neural network training.