Filter Clustering for Compressing CNN Model with Better Feature Diversity

Zhenyu Wang,Xuemei Xie,Qinghang Zhao,Guangming Shi
DOI: https://doi.org/10.1109/tcsvt.2022.3216101
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:As a practical approach for compressing convolutional neural networks (CNNs), network pruning has been rapidly developed in recent years. The conventional methods prune inactive filters permanently from models to reduce the width of each layer and then train the pruned model until convergence. However, such methods have limitations in that: (1) The activation-based pruning criteria ignore the correlation between filters, leading to attenuation in types of features; (2) The permanent filter removal restricts the architecture of models in the subsequent training so that reducing the chances of learning more features; (3) The single-width compression may generate narrow layers that block the information flow, resulting in limited feature capacity in the next layers and hard optimization. These limitations reduce the feature diversity in the pruned model and thus lead to sub-optimal model quality. In this paper, a compression method named filter clustering is proposed to rectify the problem of poor feature diversity in traditional pruning and achieve better model quality from three perspectives. Firstly, to maintain the variety of features after pruning, we treat the model compression as a clustering task and merge filters with similar outputs, rather than removing inactive filters. Specifically, a handy estimation approach is designed to convert the similarity of the output into filter similarity, which liberates the measurement from sampling numerous images. Secondly, to increase the probability of learning more features during training, we propose a periodic training and clustering pipeline, which creates a larger optimization space by dynamically exploring different sub-model architectures. Finally, to prevent the feature capacity from being influenced by the narrow layers, we introduce and leverage a fusible anti-blocking branch to smoothly remove such layers. Extensive experiments demonstrate that the proposed method can achieve compact models with better feature diversity and reduce 1%~15% more calculations than the previous methods while maintaining performance.
What problem does this paper attempt to address?