Abstract:End-to-end (E2E) training, optimizing the entire model through error backpropagation, fundamentally supports the advancements of deep learning. Despite its high performance, E2E training faces the problems of memory consumption, parallel computing, and discrepancy with the functionalities of the actual brain. Various alternative methods have been proposed to overcome these difficulties; however, no one can yet match the performance of E2E training, thereby falling short in practicality. Furthermore, there is no deep understanding regarding differences in the trained model properties beyond the performance gap. In this paper, we reconsider why E2E training demonstrates a superior performance through a comparison with layer-wise training, a non-E2E method that locally sets errors. On the basis of the observation that E2E training has an advantage in propagating input information, we analyze the information plane dynamics of intermediate representations based on the Hilbert-Schmidt independence criterion (HSIC). The results of our normalized HSIC value analysis reveal the E2E training ability to exhibit different information dynamics across layers, in addition to efficient information propagation. Furthermore, we show that this layer-role differentiation leads to the final representation following the information bottleneck principle. It suggests the need to consider the cooperative interactions between layers, not just the final layer when analyzing the information bottleneck of deep learning.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to explore why end-to-end (E2E) training outperforms layer-wise training in terms of performance. Although E2E training has shown excellent results in deep learning, it faces issues such as memory consumption, parallel computation, and inconsistency with actual brain functions. While various alternative methods have been proposed to overcome these issues, their performance still cannot match that of E2E training, thus limiting their practical application. Additionally, there is currently a lack of in-depth understanding of the differences in model properties under different training methods.
### Main Research Content
1. **Performance Gap**:
- By comparing the performance of E2E training and layer-wise training, the authors found that layer-wise training tends to saturate in performance as network depth increases, whereas E2E training shows significant performance improvement with increased network depth.
- Layer-wise training tends to lose input information in the early layers, leading to insufficient representation capability in the final layers.
2. **Information Propagation**:
- The authors used the Hilbert-Schmidt Independence Criterion (HSIC) to analyze the dynamics of information planes in intermediate representations, finding that E2E training exhibits different information dynamics between layers, allowing more effective propagation of input information.
- E2E training achieves the information bottleneck principle through layer-role differentiation, compressing intermediate representations while maintaining high HSIC values.
3. **Information Bottleneck**:
- By analyzing the differences between E2E training and layer-wise training through the information bottleneck theory, the authors found that E2E training exhibits information bottleneck behavior in the final layer, whereas layer-wise training shows uniform compression or increase in each layer.
- This layer-role differentiation allows E2E training to better retain task-relevant information in intermediate layers, resulting in better representation in the final layer.
### Experimental Results
- **Linear Separability**:
- E2E training gradually improves linear separability across layers, while layer-wise training quickly saturates after an initial increase in early layers.
- Retraining experiments indicate that layer-wise training has already lost useful input information in the early layers, further confirming the information collapse hypothesis.
- **HSIC Dynamics**:
- In the LeNet5 model, E2E training and layer-wise training show similar performance on the HSIC plane for the MNIST dataset, but for the CIFAR10 dataset, E2E training shows consistent HSIC value increases across all layers, whereas layer-wise training shows an initial increase followed by a decrease in the first layer's HSIC value.
### Conclusion
This paper not only reveals the advantages of E2E training in terms of information propagation and information bottleneck but also emphasizes the need to consider the synergy between layers, not just the final layer, when analyzing the information bottleneck in deep learning. These findings provide direction for future research on learning methods without backpropagation and reevaluate the advantages of E2E training.