Abstract:Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.

What problem does this paper attempt to address?

This paper mainly discusses the effectiveness of the μ-Transfer method in large-scale neural network initialization and learning rate setting. μ-Transfer is a technique for transferring hyperparameters from small models to large models, aiming to solve the problem of inaccurate initialization and learning rate setting of large models due to the high cost of large-scale hyperparameter search. Although μ-Transfer has the potential to provide stable training and low-cost optimal hyperparameters, its actual effect, compatibility, and performance on very large models still need to be verified. The study focuses on the Transformer architecture, and through extensive ablation studies and large-scale experiments (up to 10 billion parameters and 190 billion tokens), the authors found that μ-Transfer works as expected in most cases, but may not be applicable in certain specific cases. They also studied the compatibility of μ-Transfer with commonly used techniques in practice, such as decoupled weight decay, multiplicative non-linearity, etc., and found that in some cases, such as using trainable gains or excessively large attention ratios, μ-Transfer may not transfer the optimal learning rates effectively. In addition, the paper points out that although μ-Transfer performs well in transferring from small-scale models to large-scale models, as the model size increases, there may be a slight drift in the optimal value of the learning rate. Finally, the authors suggest considering omitting certain components (such as trainable gains) to maintain the effectiveness of μ-Transfer when training large models, and point out that the transfer of μ-Transfer between different model sizes is beneficial for the optimization process. In summary, this paper provides in-depth empirical research on the μ-Transfer method, revealing its applicability and limitations in large-scale Transformer models, and provides valuable insights for future optimization and simplification of large-scale neural network training.

A Large-Scale Exploration of $μ$-Transfer

u-$μ$P: The Unit-Scaled Maximal Update Parametrization

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale

$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

Approximating Two-Layer Feedforward Networks for Efficient Transformers

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

Warmstarting for Scaling Language Models

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Transferable Neural Processes for Hyperparameter Optimization

Massive Exploration of Neural Machine Translation Architectures

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Parameter-efficient Feature-Based Transfer for Paraphrase Identification.

Enhancing Scalability of Pre-trained Language Models Via Efficient Parameter Sharing.

Towards a Unified View of Parameter-Efficient Transfer Learning

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization