Merging Vision Transformers from Different Tasks and Domains

Peng Ye,Chenyu Huang,Mingzhu Shen,Tao Chen,Yongqi Huang,Yuning Zhang,Wanli Ouyang
DOI: https://doi.org/10.48550/arxiv.2312.16240
2023-01-01
Abstract:This work targets to merge various Vision Transformers (ViTs) trained ondifferent tasks (i.e., datasets with different object categories) or domains(i.e., datasets with the same categories but different environments) into oneunified model, yielding still good performance on each task or domain. Previousmodel merging works focus on either CNNs or NLP models, leaving the ViTsmerging research untouched. To fill this gap, we first explore and find thatexisting model merging methods cannot well handle the merging of the whole ViTmodels and still have improvement space. To enable the merging of the wholeViT, we propose a simple-but-effective gating network that can both merge allkinds of layers (e.g., Embedding, Norm, Attention, and MLP) and select thesuitable classifier. Specifically, the gating network is trained by unlabeleddatasets from all the tasks (domains), and predicts the probability of whichtask (domain) the input belongs to for merging the models during inference. Tofurther boost the performance of the merged model, especially when thedifficulty of merging tasks increases, we design a novel metric of model weightsimilarity, and utilize it to realize controllable and combined weight merging.Comprehensive experiments on kinds of newly established benchmarks, validatethe superiority of the proposed ViT merging framework for different tasks anddomains. Our method can even merge beyond 10 ViT models from different visiontasks with a negligible effect on the performance of each task.
What problem does this paper attempt to address?