ConSept: Continual Semantic Segmentation via Adapter-based Vision Transformer

Bowen Dong,Guanglei Yang,Wangmeng Zuo,Lei Zhang
DOI: https://doi.org/10.48550/arXiv.2402.16674
2024-02-26
Abstract:In this paper, we delve into the realm of vision transformers for continual semantic segmentation, a problem that has not been sufficiently explored in previous literature. Empirical investigations on the adaptation of existing frameworks to vanilla ViT reveal that incorporating visual adapters into ViTs or fine-tuning ViTs with distillation terms is advantageous for enhancing the segmentation capability of novel classes. These findings motivate us to propose Continual semantic Segmentation via Adapter-based ViT, namely ConSept. Within the simplified architecture of ViT with linear segmentation head, ConSept integrates lightweight attention-based adapters into vanilla ViTs. Capitalizing on the feature adaptation abilities of these adapters, ConSept not only retains superior segmentation ability for old classes, but also attains promising segmentation quality for novel classes. To further harness the intrinsic anti-catastrophic forgetting ability of ConSept and concurrently enhance the segmentation capabilities for both old and new classes, we propose two key strategies: distillation with a deterministic old-classes boundary for improved anti-catastrophic forgetting, and dual dice losses to regularize segmentation maps, thereby improving overall segmentation performance. Extensive experiments show the effectiveness of ConSept on multiple continual semantic segmentation benchmarks under overlapped or disjoint settings. Code will be publicly available at \url{<a class="link-external link-https" href="https://github.com/DongSky/ConSept" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore and solve several key problems in continual semantic segmentation, especially in the application under the Vision Transformer (ViT) framework. Specifically: 1. **Limitations of traditional methods**: - Traditional Convolutional Neural Networks (CNNs) encounter two fundamental challenges in continual learning: the fixed - sized convolution kernels limit the long - range interaction ability of the feature extractor, thus limiting the overall segmentation performance; and their ability to resist catastrophic forgetting of old classes is limited. - Existing ViT - based methods, although performing excellently, rely on complex decoders and additional region proposals, which limit their practicality. 2. **Potential and challenges of ViT in continual semantic segmentation**: - Although the pre - trained ViT performs well in static tasks, in the continual learning environment, it tends to overfit the base classes, resulting in poor generalization ability for new classes. - The pre - trained ViT is prone to catastrophic forgetting when new classes are introduced, that is, the model forgets how to process old classes. 3. **The proposed new method ConSept**: - To overcome the above problems, the authors propose a new framework of adapter - based Vision Transformer - ConSept (Continual Semantic Segmentation via Adapter - based ViT). - ConSept solves these problems in the following ways: - **Light - weight adapters**: Integrate light - weight attention adapters into ViT to enhance the generalization ability for new classes while maintaining high segmentation quality for old classes. - **Distillation strategy**: Adopt deterministic old - class boundaries for distillation to improve the ability to resist catastrophic forgetting. - **Dual - Dice loss**: Introduce dual - Dice loss to regularize the segmentation map, thereby improving the overall segmentation performance. ### Formula presentation - **Cross - attention mechanism**: \[ x_l^{\text{vit}} = x_l^{\text{vit}} + \text{Attn}(\text{norm}(x_l^{\text{vit}}), \text{norm}(x_l^{\text{ada}}), \text{norm}(x_l^{\text{ada}})) \] where \(\text{norm}(\cdot)\) represents LayerNorm, and \(\text{Attn}(q,k,v)\) represents the cross - attention mechanism. - **Pseudo - label generation**: \[ \hat{S}_{1:t - 1}=\max_{C_{1:t - 1}}\sigma(M_{t - 1}) \] where \(\sigma(\cdot)\) is the sigmoid function and \(M_{t - 1}\) is the predicted segmentation mask. - **Final pseudo - label merging**: \[ \hat{S}_{1:t,i}= \begin{cases} S_{t,i}&\text{if }S_{t,i}\in C_t\\ \hat{S}_{1:t - 1,i}&\text{if }S_{t,i}\notin C_t \end{cases} \] Through these improvements, ConSept not only improves the generalization ability for new classes but also significantly enhances the ability to resist catastrophic forgetting of old classes, thus achieving leading performance on multiple continual semantic segmentation benchmarks.