Abstract:This paper introduces a visual sandbox designed to explore the training dynamics of a small-scale transformer model, with the embedding dimension constrained to $d=2$. This restriction allows for a comprehensive two-dimensional visualization of each layer's dynamics. Through this approach, we gain insights into training dynamics, circuit transferability, and the causes of loss spikes, including those induced by the high curvature of normalization layers. We propose strategies to mitigate these spikes, demonstrating how good visualization facilitates the design of innovative ideas of practical interest. Additionally, we believe our sandbox could assist theoreticians in assessing essential training dynamics mechanisms and integrating them into future theories. The code is available at <a class="link-external link-https" href="https://github.com/facebookresearch/pal" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to gain an in - depth understanding of the training dynamics of neural networks (especially the Transformer model). Specifically, the author explores the internal mechanisms and behavioral changes of small - scale Transformer models during the training process by constructing a visual "sandbox" tool. The following are the specific problems that the paper attempts to solve: 1. **Visualization of Training Dynamics**: - By limiting the embedding dimension $d = 2$, the training dynamics of each layer can be fully visualized on a two - dimensional plane. - Provide detailed observations of the behavior of each layer during the training process, including the typical two - stage learning process of representation learning and classifier fitting. 2. **Circuit Transferability**: - Demonstrate the circuit transferability between different tasks, emphasizing the importance of curriculum learning and data curation. - Explore how to apply the circuits learned from one task to other tasks, thereby improving the generalization ability of the model. 3. **Loss Spikes and Their Mitigation Strategies**: - Research the causes of loss spikes caused by high - curvature normalization layers and propose possible mitigation strategies. - Analyze the impact of these spikes on training stability and propose improvement methods to achieve a more stable training process. 4. **Bridging Theory and Practice**: - Provide a tool for theoretical researchers to intuitively understand the key training dynamic mechanisms, which is helpful for developing new theories. - Provide a platform for practitioners to test new ideas, such as optimizer modification, architecture adjustment, training settings, and data processing. 5. **Resource Efficiency and Environmental Impact**: - By studying the training dynamics of small - scale models, reduce the computational resources and carbon emissions required for training large - scale models. - Provide an effective method to analyze and optimize model performance without relying on the high cost of large - scale models. In summary, this paper aims to reveal the internal mechanisms by visualizing and analyzing the training dynamics of small - scale Transformer models and provide valuable insights for theoretical research and practical applications. This not only helps to improve model performance but also reduces training costs and environmental burdens.

A Visual Case Study of the Training Dynamics in Neural Networks

Understanding Neural Networks Through Deep Visualization

Visualizing the Loss Landscape of Neural Nets

On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages

Sparsity in Continuous-Depth Neural Networks

Collective variables of neural networks: empirical time evolution and scaling laws

An In-Situ Visual Analytics Framework for Deep Neural Networks

NeuralVis: Visualizing and Interpreting Deep Learning Models

Effective Vision Transformer Training: A Data-Centric Perspective

Exploring the Geometry and Topology of Neural Network Loss Landscapes

Order and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts

EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones.

Geometric Dynamics of Signal Propagation Predict Trainability of Transformers

Enhancing Neural Training via a Correlated Dynamics Model

Identifying Equivalent Training Dynamics

Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

Visualizing Deep Neural Networks with Topographic Activation Maps

Visualizing the PHATE of Neural Networks

Unraveling the Gradient Descent Dynamics of Transformers

How Do Training Methods Influence the Utilization of Vision Models?

Exploring the Evolution of Hidden Activations with Live-Update Visualization