Abstract:The theory of greedy low-rank learning (GLRL) aims to explain the impressive generalization capabilities of deep learning. It proves that stochastic gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training. However, there is a gap between theory and practice since GLRL requires an infinitesimal initialization of the weights, which is not practical due to the fact that it is a saddle point. In this work, we remove the assumption of infinitesimal initialization by focusing on cumulative weight updates. We prove the cumulative weight updates follow an incremental low-rank trajectory for arbitrary orthogonal initialization of weights in a three-layer linear network. Empirically, we demonstrate that our theory holds on a broad range of neural networks (e.g., transformers) and standard training algorithms (e.g., SGD, Adam). However, existing training algorithms do not exploit the low-rank property to improve computational efficiency as the networks are not parameterized in low-rank. To remedy this, we design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices while incrementally augmenting their ranks during training. We evaluate InRank on GPT-2, and our results indicate that InRank achieves comparable prediction performance as the full-rank counterpart while requiring at most 33% of the total ranks throughout training. We also propose an efficient version of InRank that achieves a reduction of 37% in total training time and 36% in model size when training GPT-medium on WikiText-103 from scratch.

Low-Rank Learning by Design: the Role of Network Architecture and Activation Linearity in Gradient Rank Collapse

Neural Collapse versus Low-rank Bias: Is Deep Neural Collapse Really Optimal?

Neural Rank Collapse: Weight Decay and Small Within-Class Variability Yield Low-Rank Bias

The Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features

Low Tensor Rank Learning of Neural Dynamics

Rank Diminishing in Deep Neural Networks

Dynamics in Deep Classifiers Trained with the Square Loss: Normalization, Low Rank, Neural Collapse, and Generalization Bounds

Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

InRank: Incremental Low-Rank Learning

Effective Rank and the Staircase Phenomenon: New Insights into Neural Network Training Dynamics

Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

Gradient dynamics for low-rank fine-tuning beyond kernels

How connectivity structure shapes rich and lazy learning in neural circuits

Lambda-Skip Connections: the architectural component that prevents Rank Collapse

An Unconstrained Layer-Peeled Perspective on Neural Collapse

Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model

Understanding Deep Learning via Notions of Rank

Towards Understanding Neural Collapse: The Effects of Batch Normalization and Weight Decay

On Generalization Bounds for Neural Networks with Low Rank Layers

Low Rank Optimization for Efficient Deep Learning: Making A Balance between Compact Architecture and Fast Training