Abstract:Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration.

What problem does this paper attempt to address?

The paper aims to address key issues in the field of weakly supervised semantic segmentation (WSSS), especially improving segmentation performance with only image-level labels. The authors propose a new method called Adaptive Patch Contrast (APC), which is based on Visual Transformer (ViT) and improves on some limitations of ViT in WSSS. Specifically, the paper addresses the following problems: 1. **Limitations of patch embedding**: Existing ViT methods are prone to being dominated by outlier patches when selecting patches, resulting in inaccurate classification. APC solves this problem by introducing an Adaptive-K Pooling (AKP) layer, which selects the optimal K value based on the differences in prediction scores, thereby avoiding the impact of individual patch errors on overall performance. 2. **Inefficiency of multi-stage training**: Many ViT-based methods adopt a multi-stage training framework, which improves segmentation accuracy but significantly increases training time and complexity. APC proposes an end-to-end single-stage training method, eliminating the dependence on Class Activation Maps (CAM) and greatly improving training efficiency. 3. **Enhancement of contrastive learning**: To further improve the quality of patch embedding, APC introduces Patch Contrastive Learning (PCL), which enhances compactness within the same category and separability between different categories by computing the cosine similarity between patch embeddings, thereby improving the quality of final pseudo-labels. In summary, APC aims to improve the training efficiency and robustness of weakly supervised semantic segmentation tasks while maintaining high segmentation accuracy, especially in scenarios with image-level labels. Through experiments, APC demonstrates superior performance to other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 datasets, achieving better results in shorter training time.

APC: Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

Top-K Pooling with Patch Contrastive Learning for Weakly-Supervised Semantic Segmentation

Token Contrast for Weakly-Supervised Semantic Segmentation

PatchNet: Maximize the Exploration of Congeneric Semantics for Weakly Supervised Semantic Segmentation

Complementary Patch for Weakly Supervised Semantic Segmentation

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

Dual-Augmented Transformer Network for Weakly Supervised Semantic Segmentation

Weakly Supervised Semantic Segmentation with Patch-Based Metric Learning Enhancement

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Weakly Supervised Semantic Segmentation by Pixel-to-Prototype Contrast

MECPformer: Multi-estimations Complementary Patch with CNN-Transformers for Weakly Supervised Semantic Segmentation

Weakly Supervised Semantic Segmentation via Progressive Patch Learning

SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation

PCSformer: Pair-wise Cross-scale Sub-prototypes Mining with CNN-transformers for Weakly Supervised Semantic Segmentation

Contrastive Tokens and Label Activation for Remote Sensing Weakly Supervised Semantic Segmentation

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Learning Visual Words for Weakly-Supervised Semantic Segmentation

Image Augmentation Agent for Weakly Supervised Semantic Segmentation

Cross-Patch Relation Enhanced for Weakly Supervised Semantic Segmentation

A Weakly Supervised Semantic Segmentation Method Based on Local Superpixel Transformation