Simultaneously Training and Compressing Vision-and-Language Pre-Training Model

Qiaosong Qi,Aixi Zhang,Yue Liao,Wenyu Sun,Yongliang Wang,Xiaobo Li,Si Liu
DOI: https://doi.org/10.1109/tmm.2022.3233258
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Model compression is an essential step for large-scale pre-training models toward practical application and deployment on the edge device. However, when conventional compression methods following ‘pre-training then compressing’ two-phase pipeline are applied to Vision-and-Language Pre-training (VLP) models, it will lead to a high calculation and memory overhead. In this work, we break the two-phase pipeline and propose an efficient and effective one-phase VLP model compression mechanism, named REDUCER , which stands for ‘simultaneously training and comp RE ssing’ VLP model via progressive mo DU le repla C ing and n E twork R ewiring. Specifically, REDUCER consists of three insightful designs. Firstly, we design a one-phase compression framework to train and compress the VLP model simultaneously to avoid the extra calculation and memory cost caused by an isolated model compression phase in the conventional two-phase pipeline. Secondly, we propose an adaptive progressive module replacing mechanism to compress the model depth free from explicit knowledge distillation losses, relieving the multi-task optimization problems. Thirdly, we integrate pruning techniques into VLP model compression to simultaneously compress the model in width and depth. Overall, we obtain a lightweight VLP model with only one pre-training phase, and it is the first one-phase compression method for VLP models. Extensive experiments have been conducted on representative VLP models, i.e. , ClipBERT and VICTOR, and the experimental results show a superior trade-off between performance and efficiency.
What problem does this paper attempt to address?