Abstract:In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of CNNs, CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer. Code available at:

Learning graph structures with transformer for weakly supervised semantic segmentation

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

Dual-Augmented Transformer Network for Weakly Supervised Semantic Segmentation

Combinatorial CNN-Transformer Learning with Manifold Constraints for Semi-supervised Medical Image Segmentation

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing

A novel transformer-based semantic segmentation framework for structural condition assessment

Class-related Graph Convolution for Weakly Supervised Semantic Segmentation

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Local Transformer Network on 3D Point Cloud Semantic Segmentation

TransWS: Transformer-Based Weakly Supervised Histology Image Segmentation.

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Local-enhanced multi-scale aggregation swin transformer for semantic segmentation of high-resolution remote sensing images

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers

HiCT: Hierarchical Comprehend of Transformer for Weakly Supervised Object Localization

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation