Abstract:In this work, we seek to learn multiple mainstream vision tasks concurrently using a unified network, which is storage-efficient as numerous networks with task-shared parameters can be implanted into a single consolidated network. Our framework, vision transformer (ViT)-MVT, built on a plain and nonhierarchical ViT, incorporates numerous visual tasks into a modest supernet and optimizes them jointly across various dataset domains. For the design of ViT-MVT, we augment the ViT with a multihead self-attention (MHSE) to offer complementary cues in the channel and spatial dimension, as well as a local perception unit (LPU) and locality feed-forward network (locality FFN) for information exchange in the local region, thus endowing ViT-MVT with the ability to effectively optimize multiple tasks. Besides, we construct a search space comprising potential architectures with a broad spectrum of model sizes to offer various optimum candidates for diverse tasks. After that, we design a layer-adaptive sharing technique that automatically determines whether each layer of the transformer block is shared or not for all tasks, enabling ViT-MVT to obtain task-shared parameters for a reduction of storage and task-specific parameters to learn task-related features such that boosting performance. Finally, we introduce a joint-task evolutionary search algorithm to discover an optimal backbone for all tasks under total model size constraint, which challenges the conventional wisdom that visual tasks are typically supplied with backbone networks developed for image classification. Extensive experiments reveal that ViT-MVT delivers exceptional performances for multiple visual tasks over state-of-the-art methods while necessitating considerably fewer total storage costs. We further demonstrate that once ViT-MVT has been trained, ViT-MVT is capable of incremental learning when generalized to new tasks while retaining identical performances for trained tasks. The code is available at https://github.com/XT-1997/vitmvt.

Meta-attention for ViT-backed Continual Learning

Constituent Attention for Vision Transformers

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

FasterViT: Fast Vision Transformers with Hierarchical Attention

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Continual Learning via Learning a Continual Memory in Vision Transformer

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

ViT-MVT: A Unified Vision Transformer Network for Multiple Vision Tasks.

Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization

Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Scratching Visual Transformer's Back with Uniform Attention

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Masked Image Residual Learning for Scaling Deeper Vision Transformers

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning

DeepViT: Towards Deeper Vision Transformer

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Mask-Guided Vision Transformer for Few-Shot Learning

VSA: Learning Varied-Size Window Attention in Vision Transformers

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition