Abstract:In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of CNNs, CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer. Code available at:

Swin Transformer with Local Aggregation

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

Improved deep learning image classification algorithm based on Swin Transformer V2

Local-enhanced multi-scale aggregation swin transformer for semantic segmentation of high-resolution remote sensing images

SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection

Image Super-resolution Reconstruction Network based on Enhanced Swin Transformer via Alternating Aggregation of Local-Global Features

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

SwinHCST: a deep learning network architecture for scene classification of remote sensing images based on improved CNN and Transformer

Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

Semantic-Aware Local-Global Vision Transformer

Swin-TransUper: Swin Transformer-based UperNet for medical image segmentation

Swin Transformer coupling CNNs Makes Strong Contextual Encoders for VHR Image Road Extraction

Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery

SwinSOD: Salient object detection using swin-transformer

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

G-SwinJSCC: Combining Transformer and GCN for Wireless Image Transmission