Abstract:It is shown that for deep neural networks, a single wide layer of width $N+1$ ($N$ being the number of training samples) suffices to prove the connectivity of sublevel sets of the training loss function. In the two-layer setting, the same property may not hold even if one has just one neuron less (i.e. width $N$ can lead to disconnected sublevel sets).
What problem does this paper attempt to address?
The problem that this paper attempts to solve is related to the connectivity of sublevel sets of the training loss function in deep neural networks. Specifically, the author explores whether the sublevel sets remain connected under different widths (i.e., the number of neurons), and tries to find the minimum width condition to ensure connectivity.
### Background of the Paper
In deep learning, the loss landscape of a neural network is crucial for understanding the optimization process. A sublevel set refers to the set of all parameters such that the value of the loss function does not exceed a given threshold. If these sublevel sets are connected, it means:
1. There are no "bad" local valleys on the loss surface, that is, no local minima will cause the optimization to get stuck.
2. All global minima are located within a unique global valley, which means that there is a continuous path from any global minimum to another global minimum.
### Main Contributions
In this paper, the author improves previous research results, improving the width condition for ensuring the connectivity of sublevel sets from \(2N\) to \(N + 1\), where \(N\) is the number of training samples. Specifically:
- **Deep Architecture**: If the width of the first layer is at least \(N+1\) and other assumptions hold, then each sublevel set is connected.
- **Two - layer Network**: The author also proves that in a two - layer network, if the width of the first layer is \(N\) (i.e., one neuron less than \(N + 1\)), then the sublevel set may not be connected. This shows that \(N+1\) is the tightest condition to ensure connectivity, unless additional assumptions are made on the data or the network.
### Mathematical Formulas
Let \(N\) be the number of training samples, \(\theta=(W_{l},b_{l})_{l = 1}^{L}\) be the network parameters, and \(\Phi(\theta)\) be the training loss function. The sublevel set is defined as:
\[L_{\alpha}=\{\theta\mid\Phi(\theta)\leq\alpha\}\]
### Conclusion
Through this research, the author not only improves the existing theoretical results but also reveals the relationship between the width of the neural network and the connectivity of the sublevel sets of the loss function. This is of great significance for understanding the optimization problems in deep learning, especially providing a theoretical basis for designing more effective optimization algorithms and network structures.