Abstract:In view of the fact that semi- and self-supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi-supervised semantic segmentation methods have integrated representative self-supervised learning paradigms for further regularization. However, the potential of the state-of-the-art generative self-supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi-supervised semantic segmentation. Specifically, we introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask-induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra-class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well-known benchmarks demonstrate that our approach achieves state-of-the-art performance. The code will be available at <a class="link-external link-https" href="https://github.com/haoxt/S4MIM" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
This paper attempts to address the problem of how to effectively utilize the knowledge from unlabeled data to improve model performance in semi-supervised semantic segmentation. Specifically, the paper introduces a new method—Class-wise Masked Image Modeling (CW-MIM) to enhance the effectiveness of semi-supervised semantic segmentation.
### Background of the Paper
- **Semantic Segmentation**: Semantic segmentation is a fundamental task in computer vision, aiming to assign a class label to each pixel in an image.
- **Supervised Learning**: In recent years, supervised semantic segmentation based on deep neural networks has achieved significant success, but this method requires a large amount of pixel-level manually labeled data, which is both time-consuming and labor-intensive.
- **Semi-Supervised Learning**: To alleviate the demand for labeled data, semi-supervised semantic segmentation has been proposed, which leverages a small amount of labeled data and a large amount of unlabeled data for learning.
- **Self-Supervised Learning**: Self-supervised learning acquires representations and generalization capabilities by defining pre-training tasks on unlabeled data. These pre-trained models can be used for downstream tasks such as classification, semantic segmentation, and object detection.
### Problem Statement
Although the generative paradigm in self-supervised learning—Masked Image Modeling (MIM) has achieved great success in natural language processing and shown potential in image processing, its application in semi-supervised semantic segmentation has not been widely studied. Therefore, this paper aims to explore how to apply MIM to semi-supervised semantic segmentation to improve model performance.
### Solution
1. **Class-wise Masked Image Modeling (CW-MIM)**:
- **Basic Idea**: MIM learns knowledge by establishing connections between masked parts and visible parts. The paper introduces class-wise MIM, which reconstructs different image regions independently according to classes, thereby establishing mask-induced connections within each class and reducing semantic confusion caused by simply reconstructing the image.
- **Implementation**: In the pixel decoder, by injecting class information, intermediate features are grouped by class, with features of each class only active in the corresponding spatial region, and the rest set to zero vectors. Features of each class are reconstructed through independent heads and finally combined into the entire image in pixel space.
2. **Class-wise Mask-induced Feature Aggregation**:
- **Basic Idea**: To strengthen intra-class connections, the paper proposes a strategy to explicitly minimize the distance between features of visible parts and masked parts within the same class.
- **Implementation**: A dictionary is maintained, where each entry corresponds to a class and stores the prototype of that class. The prototype is constructed from features of the visible parts belonging to that class. For each class, features of the masked parts are constrained to approach the prototype, thereby achieving intra-class feature aggregation.
3. **MIM in Semantic Space**:
- **Basic Idea**: To more comprehensively study the role of MIM, the paper also implements MIM in the semantic space, ensuring that the semantic predictions derived from the masked image are consistent with the original image.
- **Implementation**: In the output of the semantic decoder, masked data is supervised by pseudo-labels generated from the original data, thereby maintaining semantic consistency.
### Experimental Results
The paper conducts extensive experiments on well-known benchmark datasets such as PASCAL VOC 2012 and Cityscapes, showing that the proposed method achieves state-of-the-art performance.
### Conclusion
The main contributions of the paper include:
- Introducing class-wise MIM, which independently reconstructs image regions of different classes, reducing inter-class confusion.
- Developing a class-wise mask-induced feature aggregation strategy, explicitly minimizing the distance between features of visible parts and masked parts within the same class, enhancing intra-class connections.
- Exploring the application of MIM in the semantic space, ensuring that the semantic predictions of the masked image are consistent with the original image, further improving the model's generalization ability.
Through these innovations, the paper effectively utilizes the knowledge from unlabeled data, significantly improving the performance of semi-supervised semantic segmentation.