Abstract:In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at \url{<a class="link-external link-https" href="https://github.com/DongSky/ConSept" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore and solve several key problems in continual semantic segmentation, especially in the application under the Vision Transformer (ViT) framework. Specifically: 1. **Limitations of traditional methods**: - Traditional Convolutional Neural Networks (CNNs) encounter two fundamental challenges in continual learning: the fixed - sized convolution kernels limit the long - range interaction ability of the feature extractor, thus limiting the overall segmentation performance; and their ability to resist catastrophic forgetting of old classes is limited. - Existing ViT - based methods, although performing excellently, rely on complex decoders and additional region proposals, which limit their practicality. 2. **Potential and challenges of ViT in continual semantic segmentation**: - Although the pre - trained ViT performs well in static tasks, in the continual learning environment, it tends to overfit the base classes, resulting in poor generalization ability for new classes. - The pre - trained ViT is prone to catastrophic forgetting when new classes are introduced, that is, the model forgets how to process old classes. 3. **The proposed new method ConSept**: - To overcome the above problems, the authors propose a new framework of adapter - based Vision Transformer - ConSept (Continual Semantic Segmentation via Adapter - based ViT). - ConSept solves these problems in the following ways: - **Light - weight adapters**: Integrate light - weight attention adapters into ViT to enhance the generalization ability for new classes while maintaining high segmentation quality for old classes. - **Distillation strategy**: Adopt deterministic old - class boundaries for distillation to improve the ability to resist catastrophic forgetting. - **Dual - Dice loss**: Introduce dual - Dice loss to regularize the segmentation map, thereby improving the overall segmentation performance. ### Formula presentation - **Cross - attention mechanism**: \[ x_l^{\text{vit}} = x_l^{\text{vit}} + \text{Attn}(\text{norm}(x_l^{\text{vit}}), \text{norm}(x_l^{\text{ada}}), \text{norm}(x_l^{\text{ada}})) \] where \(\text{norm}(\cdot)\) represents LayerNorm, and \(\text{Attn}(q,k,v)\) represents the cross - attention mechanism. - **Pseudo - label generation**: \[ \hat{S}_{1:t - 1}=\max_{C_{1:t - 1}}\sigma(M_{t - 1}) \] where \(\sigma(\cdot)\) is the sigmoid function and \(M_{t - 1}\) is the predicted segmentation mask. - **Final pseudo - label merging**: \[ \hat{S}_{1:t,i}= \begin{cases} S_{t,i}&\text{if }S_{t,i}\in C_t\\ \hat{S}_{1:t - 1,i}&\text{if }S_{t,i}\notin C_t \end{cases} \] Through these improvements, ConSept not only improves the generalization ability for new classes but also significantly enhances the ability to resist catastrophic forgetting of old classes, thus achieving leading performance on multiple continual semantic segmentation benchmarks.

ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

TransVOS: Video Object Segmentation with Transformers

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers.

SegViT: Semantic Segmentation with Plain Vision Transformers

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

Enhancing surgical instrument segmentation: integrating vision transformer insights with adapter

Representation Separation for Semantic Segmentation with Vision Transformers

Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Delving into Transformer for Incremental Semantic Segmentation

Semantic Segmentation using Vision Transformers: A survey

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Dual-Augmented Transformer Network for Weakly Supervised Semantic Segmentation

Transformer-Based Visual Segmentation: A Survey

Language-Aware Vision Transformer for Referring Segmentation

Exploring vision transformer layer choosing for semantic segmentation

Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers