Abstract:Over the past few years, an extensively studied phenomenon in training deep networks is the implicit bias of gradient descent towards parsimonious solutions. In this work, we investigate this phenomenon by narrowing our focus to deep linear networks. Through our analysis, we reveal a surprising "law of parsimony" in the learning dynamics when the data possesses low-dimensional structures. Specifically, we show that the evolution of gradient descent starting from orthogonal initialization only affects a minimal portion of singular vector spaces across all weight matrices. In other words, the learning process happens only within a small invariant subspace of each weight matrix, despite the fact that all weight parameters are updated throughout training. This simplicity in learning dynamics could have significant implications for both efficient training and a better understanding of deep networks. First, the analysis enables us to considerably improve training efficiency by taking advantage of the low-dimensional structure in learning dynamics. We can construct smaller, equivalent deep linear networks without sacrificing the benefits associated with the wider counterparts. Second, it allows us to better understand deep representation learning by elucidating the linear progressive separation and concentration of representations from shallow to deep layers. We also conduct numerical experiments to support our theoretical results. The code for our experiments can be found at <a class="link-external link-https" href="https://github.com/cjyaras/lawofparsimony" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: the implicit bias phenomenon in the Gradient Descent (GD) process when training Deep Linear Networks (DLNs). Specifically, the authors focus on how Gradient Descent tends to parsimonious solutions when the data has a low - dimensional structure. By studying this phenomenon, the paper reveals a "law of parsimony", that is, the learning dynamics of Gradient Descent mainly affects the smallest partial singular vector space of the weight matrix, while the remaining singular sub - spaces remain unchanged throughout the training process. This finding is of great significance for understanding the learning mechanism of deep networks and improving training efficiency.
### Main Contributions
1. **Constructing Smaller but Equivalent Networks**: Since the learning process only occurs within the small invariant sub - space of the weights, significantly smaller DLNs can be constructed. These small networks share the same learning dynamics as the larger original networks, thereby greatly reducing the computational complexity without sacrificing the benefits brought by wide networks.
2. **New Theoretical Insights into Deep Representation Learning**: Through the "law of parsimony", the paper explains the linear progressive separation and concentration phenomena from shallow to deep representations, which helps to better understand feature learning in multi - class classification problems.
### Research Background
In recent years, deep learning has achieved remarkable success in many applications in engineering and scientific fields. Research shows that the effectiveness of deep learning is partly attributed to the implicit bias in its learning dynamics. This bias tends to select some specific solutions that have good generalization ability and will not over - fit. In particular, Gradient Descent tends to learn simple functions and, when training linear networks for binary classification tasks, tends to the maximum - margin solution. In addition, Gradient Descent also shows a bias towards low - rank solutions.
### Main Results of the Paper
- **Law of Parsimony**: When the cross - correlation matrix of the training data has a low - dimensional structure, the learning dynamics of Gradient Descent tend to parsimonious solutions. Specifically, the Gradient Descent dynamics starting from orthogonal initialization only affect one smallest invariant sub - space of each weight matrix, while the remaining singular sub - spaces remain unchanged throughout the training process.
- **Low - Rank Implicit Bias**: When the initialization scale is small, the trajectory of Gradient Descent shows an implicit bias towards low - rank solutions, which explains why deeper networks tend to low - rank solutions during the training process.
### Application Examples
1. **Accelerating Deep Low - Rank Matrix Completion**: By using the law of parsimony, the optimization process can be accelerated by constructing an equivalent but significantly smaller network, thereby significantly improving the training efficiency.
2. **Understanding Progressive Feature Collapse**: In multi - class classification problems, the paper shows the phenomenon that features gradually collapse from shallow to deep layers and provides a theoretical explanation.
### Conclusion
By revealing the "law of parsimony" of Gradient Descent when training deep linear networks, the paper not only provides a new perspective for understanding the learning mechanism of deep networks but also provides practical methods for improving training efficiency. These findings are of great significance for improving the design and training of deep learning models.