Finding the Optimal Network Depth in Classification Tasks

Bartosz Wójcik,Maciej Wołczyk,Klaudia Bałazy,Jacek Tabor
DOI: https://doi.org/10.48550/arXiv.2004.08172
2020-04-17
Abstract:We develop a fast end-to-end method for training lightweight neural networks using multiple classifier heads. By allowing the model to determine the importance of each head and rewarding the choice of a single shallow classifier, we are able to detect and remove unneeded components of the network. This operation, which can be seen as finding the optimal depth of the model, significantly reduces the number of parameters and accelerates inference across different hardware processing units, which is not the case for many standard pruning methods. We show the performance of our method on multiple network architectures and datasets, analyze its optimization properties, and conduct ablation studies.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to find the optimal depth of neural networks in classification tasks, while reducing the number of model parameters and accelerating the inference process without sacrificing performance. Specifically, the authors propose a fast end - to - end method, NetCut, for training lightweight neural networks, and detect and remove unnecessary components in the network through a multi - head mechanism to achieve model compression and acceleration. ### Main contributions of the paper: 1. **Multi - head mechanism**: Add classification heads on each hidden layer and use the combined output of these classification heads as the final prediction result. In this way, the model can determine the importance of each classification head and select a single shallow classifier through training. 2. **Aggregation scheme**: Propose a new probability aggregation method. By combining logarithmic probabilities instead of direct probabilities, the model is encouraged to select a single classification head. This method not only avoids numerical instability but also simplifies the model. 3. **Temporal regularization**: Introduce a regularization term based on the number of network layers to simulate the time required for the network to process the input and further optimize the depth of the model. 4. **Experimental verification**: Experiments were carried out on multiple network architectures and datasets, showing the stable performance of NetCut under different settings, especially the significant improvement in inference speed on CPU and GPU with almost no performance degradation. ### Key technical details of the paper: - **Multi - head**: Add a classification head on each hidden layer. The output of each classification head is weighted - averaged by weights \( w_k \) to form the final prediction output \( \hat{o} \). - **Log - probability aggregation**: Aggregate the outputs of multiple classification heads by taking the exponential of the weighted sum of logarithmic probabilities: \[ \hat{o}(i)=\exp\left(\sum_{k} w_k \ln \hat{o}_k(i)\right) \] This method encourages the model to select a single classification head because when a certain \( w_l = 1 \) and other \( w_k = 0 \), the cross - entropy loss of the model is minimized. - **Temporal regularization**: Introduce a regularization term \( L_{\text{reg}}=\sum_{k} w_k k \) and control its influence through the hyperparameter \( \beta \) to simulate the time required for the network to process the input. ### Experimental results: - **Standard CNN**: On the CIFAR - 10 dataset, NetCut can compress a 20 - layer network into a shallower network while maintaining a high accuracy rate. - **ResNet**: On ResNet - 110, by adjusting the regularization coefficient \( \beta \), the performance loss can be balanced while compressing the model. - **Fully - connected network**: On the MNIST and CIFAR - 10 datasets, NetCut can find shallower networks, significantly reducing the computational complexity while maintaining or improving the test accuracy. - **Graph - based network**: On randomly generated graph - based networks, NetCut also shows good performance and can find the optimal sub - graph under complex connection patterns. ### Conclusion: NetCut provides an effective method to find the optimal depth of neural networks. Through the multi - head mechanism and log - probability aggregation method, model compression and acceleration are achieved while maintaining high performance. This method has shown good results on multiple network architectures and datasets and has broad application prospects.