Abstract:Over the past few years, an extensively studied phenomenon in training deep networks is the implicit bias of gradient descent towards parsimonious solutions. In this work, we investigate this phenomenon by narrowing our focus to deep linear networks. Through our analysis, we reveal a surprising "law of parsimony" in the learning dynamics when the data possesses low-dimensional structures. Specifically, we show that the evolution of gradient descent starting from orthogonal initialization only affects a minimal portion of singular vector spaces across all weight matrices. In other words, the learning process happens only within a small invariant subspace of each weight matrix, despite the fact that all weight parameters are updated throughout training. This simplicity in learning dynamics could have significant implications for both efficient training and a better understanding of deep networks. First, the analysis enables us to considerably improve training efficiency by taking advantage of the low-dimensional structure in learning dynamics. We can construct smaller, equivalent deep linear networks without sacrificing the benefits associated with the wider counterparts. Second, it allows us to better understand deep representation learning by elucidating the linear progressive separation and concentration of representations from shallow to deep layers. We also conduct numerical experiments to support our theoretical results. The code for our experiments can be found at <a class="link-external link-https" href="https://github.com/cjyaras/lawofparsimony" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: the implicit bias phenomenon in the Gradient Descent (GD) process when training Deep Linear Networks (DLNs). Specifically, the authors focus on how Gradient Descent tends to parsimonious solutions when the data has a low - dimensional structure. By studying this phenomenon, the paper reveals a "law of parsimony", that is, the learning dynamics of Gradient Descent mainly affects the smallest partial singular vector space of the weight matrix, while the remaining singular sub - spaces remain unchanged throughout the training process. This finding is of great significance for understanding the learning mechanism of deep networks and improving training efficiency. ### Main Contributions 1. **Constructing Smaller but Equivalent Networks**: Since the learning process only occurs within the small invariant sub - space of the weights, significantly smaller DLNs can be constructed. These small networks share the same learning dynamics as the larger original networks, thereby greatly reducing the computational complexity without sacrificing the benefits brought by wide networks. 2. **New Theoretical Insights into Deep Representation Learning**: Through the "law of parsimony", the paper explains the linear progressive separation and concentration phenomena from shallow to deep representations, which helps to better understand feature learning in multi - class classification problems. ### Research Background In recent years, deep learning has achieved remarkable success in many applications in engineering and scientific fields. Research shows that the effectiveness of deep learning is partly attributed to the implicit bias in its learning dynamics. This bias tends to select some specific solutions that have good generalization ability and will not over - fit. In particular, Gradient Descent tends to learn simple functions and, when training linear networks for binary classification tasks, tends to the maximum - margin solution. In addition, Gradient Descent also shows a bias towards low - rank solutions. ### Main Results of the Paper - **Law of Parsimony**: When the cross - correlation matrix of the training data has a low - dimensional structure, the learning dynamics of Gradient Descent tend to parsimonious solutions. Specifically, the Gradient Descent dynamics starting from orthogonal initialization only affect one smallest invariant sub - space of each weight matrix, while the remaining singular sub - spaces remain unchanged throughout the training process. - **Low - Rank Implicit Bias**: When the initialization scale is small, the trajectory of Gradient Descent shows an implicit bias towards low - rank solutions, which explains why deeper networks tend to low - rank solutions during the training process. ### Application Examples 1. **Accelerating Deep Low - Rank Matrix Completion**: By using the law of parsimony, the optimization process can be accelerated by constructing an equivalent but significantly smaller network, thereby significantly improving the training efficiency. 2. **Understanding Progressive Feature Collapse**: In multi - class classification problems, the paper shows the phenomenon that features gradually collapse from shallow to deep layers and provides a theoretical explanation. ### Conclusion By revealing the "law of parsimony" of Gradient Descent when training deep linear networks, the paper not only provides a new perspective for understanding the learning mechanism of deep networks but also provides practical methods for improving training efficiency. These findings are of great significance for improving the design and training of deep learning models.

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows

The Geometric Occam's Razor Implicit in Deep Learning

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

Sparse Double Descent: Where Network Pruning Aggravates Overfitting

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Low-dimensional Intrinsic Dimension Reveals a Phase Transition in Gradient-Based Learning of Deep Neural Networks

Robust Implicit Regularization via Weight Normalization

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Laplacian Smoothing Gradient Descent

Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning

Understanding the Role of Optimization in Double Descent

Demystifying Lazy Training of Neural Networks from a Macroscopic Viewpoint

Stochastic collapse: how gradient noise attracts SGD dynamics towards simpler subnetworks*

Manipulating Sparse Double Descent

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle

Implicit Sparse Regularization: The Impact of Depth and Early Stopping