Abstract:In this work, we seek to learn multiple mainstream vision tasks concurrently using a unified network, which is storage-efficient as numerous networks with task-shared parameters can be implanted into a single consolidated network. Our framework, vision transformer (ViT)-MVT, built on a plain and nonhierarchical ViT, incorporates numerous visual tasks into a modest supernet and optimizes them jointly across various dataset domains. For the design of ViT-MVT, we augment the ViT with a multihead self-attention (MHSE) to offer complementary cues in the channel and spatial dimension, as well as a local perception unit (LPU) and locality feed-forward network (locality FFN) for information exchange in the local region, thus endowing ViT-MVT with the ability to effectively optimize multiple tasks. Besides, we construct a search space comprising potential architectures with a broad spectrum of model sizes to offer various optimum candidates for diverse tasks. After that, we design a layer-adaptive sharing technique that automatically determines whether each layer of the transformer block is shared or not for all tasks, enabling ViT-MVT to obtain task-shared parameters for a reduction of storage and task-specific parameters to learn task-related features such that boosting performance. Finally, we introduce a joint-task evolutionary search algorithm to discover an optimal backbone for all tasks under total model size constraint, which challenges the conventional wisdom that visual tasks are typically supplied with backbone networks developed for image classification. Extensive experiments reveal that ViT-MVT delivers exceptional performances for multiple visual tasks over state-of-the-art methods while necessitating considerably fewer total storage costs. We further demonstrate that once ViT-MVT has been trained, ViT-MVT is capable of incremental learning when generalized to new tasks while retaining identical performances for trained tasks. The code is available at https://github.com/XT-1997/vitmvt.

Merging Vision Transformers from Different Tasks and Domains

ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks.

Multi-Scale And Token Mergence: Make Your ViT More Efficient

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Efficient Vision Transformer via Token Merger

GhostViT: Expediting Vision Transformers Via Cheap Operations

ZipIt! Merging Models from Different Tasks without Training

Training-free Heterogeneous Model Merging

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

EMR-Merging: Tuning-Free High-Performance Model Merging

Training-Free Pretrained Model Merging

Training-Free Acceleration of ViTs with Delayed Spatial Merging

GiT: Towards Generalist Vision Transformer through Universal Language Interface

ViR:the Vision Reservoir

Brain-Inspired Stepwise Patch Merging for Vision Transformers

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Tint Your Models Task-wise for Improved Multi-task Model Merging

Super Vision Transformer

FMViT: A multiple-frequency mixing Vision Transformer

TransMCGC: a Recast Vision Transformer for Small-Scale Image Classification Tasks.

Experts Weights Averaging: A New General Training Scheme for Vision Transformers