APC: Adaptive Patch Contrast for Weakly Supervised Semantic Segmentation

Wangyu Wu,Tianhong Dai,Zhenhong Chen,Xiaowei Huang,Fei Ma,Jimin Xiao
2024-07-15
Abstract:Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address key issues in the field of weakly supervised semantic segmentation (WSSS), especially improving segmentation performance with only image-level labels. The authors propose a new method called Adaptive Patch Contrast (APC), which is based on Visual Transformer (ViT) and improves on some limitations of ViT in WSSS. Specifically, the paper addresses the following problems: 1. **Limitations of patch embedding**: Existing ViT methods are prone to being dominated by outlier patches when selecting patches, resulting in inaccurate classification. APC solves this problem by introducing an Adaptive-K Pooling (AKP) layer, which selects the optimal K value based on the differences in prediction scores, thereby avoiding the impact of individual patch errors on overall performance. 2. **Inefficiency of multi-stage training**: Many ViT-based methods adopt a multi-stage training framework, which improves segmentation accuracy but significantly increases training time and complexity. APC proposes an end-to-end single-stage training method, eliminating the dependence on Class Activation Maps (CAM) and greatly improving training efficiency. 3. **Enhancement of contrastive learning**: To further improve the quality of patch embedding, APC introduces Patch Contrastive Learning (PCL), which enhances compactness within the same category and separability between different categories by computing the cosine similarity between patch embeddings, thereby improving the quality of final pseudo-labels. In summary, APC aims to improve the training efficiency and robustness of weakly supervised semantic segmentation tasks while maintaining high segmentation accuracy, especially in scenarios with image-level labels. Through experiments, APC demonstrates superior performance to other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 datasets, achieving better results in shorter training time.