Abstract:Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. However, training deep neural networks is a challenging task. Many alternatives have been proposed in place of end-to-end back-propagation. Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously. In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks. We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss. More importantly, the optimal learning rate can directly be applied in practice, as it does not require any prior knowledge. Thus, tuning the learning rate is not needed at all. Also, we identify the effects of depth, width, and initialization in the training process. We show that when the orthogonal-like initialization is employed, the width of intermediate layers plays no role in gradient-based training, as long as the width is greater than or equal to both the input and output dimensions. We show that under some conditions, the deeper the network is, the faster the convergence is guaranteed. This implies that in an extreme case, the global optimum is achieved after updating each weight matrix only once. Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered. Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.

Deep orthogonal linear networks are shallow

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks

Deep linear networks for regression are implicitly regularized towards flat minima

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks

Over-parametrized neural networks as under-determined linear systems

On the Impact of Overparameterization on the Training of a Shallow Neural Network in High Dimensions

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Can Shallow Neural Networks Beat the Curse of Dimensionality? A mean field training perspective

Effects of Depth, Width, and Initialization: A Convergence Analysis of Layer-wise Training for Deep Linear Neural Networks

Generalizing Orthogonalization for Models with Non-Linearities

Training Over-parameterized Deep ResNet is Almost As Easy As Training a Two-layer Network

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank

Nonparametric regression using over-parameterized shallow ReLU neural networks

Topological obstruction to the training of shallow ReLU neural networks

Any Deep ReLU Network is Shallow

Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think