Abstract:End-to-end (E2E) training, optimizing the entire model through error backpropagation, fundamentally supports the advancements of deep learning. Despite its high performance, E2E training faces the problems of memory consumption, parallel computing, and discrepancy with the functionalities of the actual brain. Various alternative methods have been proposed to overcome these difficulties; however, no one can yet match the performance of E2E training, thereby falling short in practicality. Furthermore, there is no deep understanding regarding differences in the trained model properties beyond the performance gap. In this paper, we reconsider why E2E training demonstrates a superior performance through a comparison with layer-wise training, a non-E2E method that locally sets errors. On the basis of the observation that E2E training has an advantage in propagating input information, we analyze the information plane dynamics of intermediate representations based on the Hilbert-Schmidt independence criterion (HSIC). The results of our normalized HSIC value analysis reveal the E2E training ability to exhibit different information dynamics across layers, in addition to efficient information propagation. Furthermore, we show that this layer-role differentiation leads to the final representation following the information bottleneck principle. It suggests the need to consider the cooperative interactions between layers, not just the final layer when analyzing the information bottleneck of deep learning.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to explore why end-to-end (E2E) training outperforms layer-wise training in terms of performance. Although E2E training has shown excellent results in deep learning, it faces issues such as memory consumption, parallel computation, and inconsistency with actual brain functions. While various alternative methods have been proposed to overcome these issues, their performance still cannot match that of E2E training, thus limiting their practical application. Additionally, there is currently a lack of in-depth understanding of the differences in model properties under different training methods. ### Main Research Content 1. **Performance Gap**: - By comparing the performance of E2E training and layer-wise training, the authors found that layer-wise training tends to saturate in performance as network depth increases, whereas E2E training shows significant performance improvement with increased network depth. - Layer-wise training tends to lose input information in the early layers, leading to insufficient representation capability in the final layers. 2. **Information Propagation**: - The authors used the Hilbert-Schmidt Independence Criterion (HSIC) to analyze the dynamics of information planes in intermediate representations, finding that E2E training exhibits different information dynamics between layers, allowing more effective propagation of input information. - E2E training achieves the information bottleneck principle through layer-role differentiation, compressing intermediate representations while maintaining high HSIC values. 3. **Information Bottleneck**: - By analyzing the differences between E2E training and layer-wise training through the information bottleneck theory, the authors found that E2E training exhibits information bottleneck behavior in the final layer, whereas layer-wise training shows uniform compression or increase in each layer. - This layer-role differentiation allows E2E training to better retain task-relevant information in intermediate layers, resulting in better representation in the final layer. ### Experimental Results - **Linear Separability**: - E2E training gradually improves linear separability across layers, while layer-wise training quickly saturates after an initial increase in early layers. - Retraining experiments indicate that layer-wise training has already lost useful input information in the early layers, further confirming the information collapse hypothesis. - **HSIC Dynamics**: - In the LeNet5 model, E2E training and layer-wise training show similar performance on the HSIC plane for the MNIST dataset, but for the CIFAR10 dataset, E2E training shows consistent HSIC value increases across all layers, whereas layer-wise training shows an initial increase followed by a decrease in the first layer's HSIC value. ### Conclusion This paper not only reveals the advantages of E2E training in terms of information propagation and information bottleneck but also emphasizes the need to consider the synergy between layers, not just the final layer, when analyzing the information bottleneck in deep learning. These findings provide direction for future research on learning methods without backpropagation and reevaluate the advantages of E2E training.

End-to-End Training Induces Information Bottleneck through Layer-Role Differentiation: A Comparative Analysis with Layer-wise Training

On the information bottleneck theory of deep learning

Layer-wise Learning of Stochastic Neural Networks with Information Bottleneck

Information Bottleneck Theory Based Exploration of Cascade Learning

Elastic Information Bottleneck

Deep Learning and the Information Bottleneck Principle

How Does Information Bottleneck Help Deep Learning?

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Going Deeper, Generalizing Better: an Information-Theoretic View for Deep Learning.

Information Bottleneck in Deep Learning - A Semiotic Approach

Information Bottleneck Theory on Convolutional Neural Networks

Discrete Key-Value Bottleneck

Information Plane Analysis Visualization in Deep Learning via Transfer Entropy

Limits of End-to-End Learning

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Information bottleneck-based Hebbian learning rule naturally ties working memory and synaptic updates

Drill the Cork of Information Bottleneck by Inputting the Most Important Data

On the Difference Between the Information Bottleneck and the Deep Information Bottleneck

End-to-End Learning for Task-Oriented Semantic Communications Over MIMO Channels: An Information-Theoretic Framework

How Do Training Methods Influence the Utilization of Vision Models?

A Layer-Wise Theoretical Framework for Deep Learning of Convolutional Neural Networks