Background Noise Reduction of Attention Map for Weakly Supervised Semantic Segmentation

Izumi Fujimori,Masaki Oono,Masami Shishibori
2024-04-09
Abstract:In weakly-supervised semantic segmentation (WSSS) using only image-level class labels, a problem with CNN-based Class Activation Maps (CAM) is that they tend to activate the most discriminative local regions of objects. On the other hand, methods based on Transformers learn global features but suffer from the issue of background noise contamination. This paper focuses on addressing the issue of background noise in attention weights within the existing WSSS method based on Conformer, known as TransCAM. The proposed method successfully reduces background noise, leading to improved accuracy of pseudo labels. Experimental results demonstrate that our model achieves segmentation performance of 70.5% on the PASCAL VOC 2012 validation data, 71.1% on the test data, and 45.9% on MS COCO 2014 data, outperforming TransCAM in terms of segmentation performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a solution to the problem of background noise in Weakly Supervised Semantic Segmentation (WSSS). When using only image-level category labels for WSSS, Convolutional Neural Networks (CNN) based Class Activation Maps (CAM) tend to activate the most discriminative regions of objects, while Transformer-based methods can learn global features but suffer from background noise pollution. To address the influence of background noise on attention weights, the paper introduces a method called TransCAM, which reduces background noise in existing Conformer-based WSSS methods to improve the accuracy of pseudo labels. Specifically, the proposed method takes the enhanced CAM (obtained by enhancing attention maps) as input to the loss function during training to reduce background noise originating from attention. Experimental results show that the proposed method achieves a segmentation performance of 70.5% on the PASCAL VOC 2012 validation dataset, 71.1% on the test dataset, and 45.9% on the MS COCO 2014 dataset, outperforming TransCAM in terms of segmentation performance. The main contributions of the paper are as follows: 1. Proposing a method to input attention-enhanced CAM into the loss function during training, effectively reducing background noise. 2. Experimental results on the PASCAL VOC 2012 and MS COCO 2014 datasets demonstrate superior segmentation accuracy compared to existing methods.