Abstract:This paper explores the capability of ViT-based models under the generalized few-shot semantic segmentation (GFSS) framework. We conduct experiments with various combinations of backbone models, including ResNets and pretrained Vision Transformer (ViT)-based models, along with decoders featuring a linear classifier, UPerNet, and Mask Transformer. The structure made of DINOv2 and linear classifier takes the lead on popular few-shot segmentation bench mark PASCAL-$5^i$, substantially outperforming the best of ResNet structure by 116% in one-shot scenario. We demonstrate the great potential of large pretrained ViT-based model on GFSS task, and expect further improvement on testing benchmarks. However, a potential caveat is that when applying pure ViT-based model and large scale ViT decoder, the model is easy to overfit.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to explore the capabilities of Vision Transformer (ViT) - based models within the Generalized Few - shot Semantic Segmentation (GFSS) framework. Specifically, the main research objectives include: 1. **Improving performance in GFSS tasks**: By using different types of ViT pre - trained models (such as DINO and DINOv2), as well as different decoders (linear classifier, UPerNet, Mask Transformer), to verify the performance of these models in GFSS tasks, and expect to achieve further improvements on the test benchmark. 2. **Addressing the over - fitting problem**: Although ViT - based models perform well in some cases, over - fitting is likely to occur when using a pure ViT architecture and large - scale ViT decoders. Therefore, the research also explores how to alleviate this problem. 3. **Evaluating the performance of different model architectures**: By comparing the performance of ResNet and ViT - based models in GFSS tasks, evaluate the advantages and potential problems of ViT - based models relative to traditional CNN models. ### Specific problem description - **Limitations of standard semantic segmentation models**: Traditional semantic segmentation models usually require a large amount of labeled data for training and can only predict a fixed set of predefined classes in the training set. When applied to new classes, obtaining sufficient labeled data is very time - consuming, which limits the scalability of the model. - **Limitations of FSS methods**: Existing Few - shot Segmentation (FSS) methods can quickly adapt to new classes with a small amount of labeled data, but they assume that the classes in the query image are completely covered in the support image, which is not very realistic in practical applications. In addition, FSS models are only evaluated on new classes, ignoring the performance degradation problems of base classes and other new classes. - **Requirements of GFSS tasks**: To be closer to practical application scenarios, GFSS tasks do not require that the classes in the support image and the query image are completely the same, and simultaneously evaluate the performance of base classes and new classes. This setting is more in line with the needs in the real world. ### Research contributions - **Experimental results**: Through experiments on datasets such as PASCAL - 5i, it is proved that ViT models based on DINO and DINOv2 are significantly superior to ResNet - based models in GFSS tasks, especially in the one - shot scenario, with a 116% performance improvement. - **Discovering the over - fitting problem**: The research shows that when connected to DINOv2, Mask Transformer is more likely to over - fit than the linear classifier. In summary, this paper aims to improve the performance of GFSS tasks by introducing ViT - based models and explores the potential and challenges of these models in practical applications.

Applying ViT in Generalized Few-shot Semantic Segmentation

SegViT: Semantic Segmentation with Plain Vision Transformers

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers.

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Mask-Guided Vision Transformer for Few-Shot Learning

Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets

Representation Separation for Semantic Segmentation with Vision Transformers

Global Context Vision Transformers

Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

HSViT: Horizontally Scalable Vision Transformer

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Vision Transformers: From Semantic Segmentation to Dense Prediction

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

ViR:the Vision Reservoir

Vision Transformers with Natural Language Semantics

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Deep ViT Features as Dense Visual Descriptors

DeiT III: Revenge of the ViT

Bi-ViT: Pushing the Limit of Vision Transformer Quantization