SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Kaiyu Li,Ruixun Liu,Xiangyong Cao,Xueru Bai,Feng Zhou,Deyu Meng,Zhi Wang
2024-11-04
Abstract:Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel-level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training-free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4.0%, and 15.3% improvement over state-of-the-art methods on 4 tasks. All codes are released. \url{<a class="link-external link-https" href="https://earth-insights.github.io/SegEarth-OV" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve open - vocabulary semantic segmentation (OVSS) in remote - sensing images without training. Specifically, the authors focus on how to improve the pixel - level interpretability of remote - sensing images without a large amount of manual annotation. Traditional methods usually require a large amount of manually - annotated data to train models, which is a huge challenge in remote - sensing image processing because the cost of obtaining large - scale labels is very high. In addition, remote - sensing images have problems such as being sensitive to low - resolution features, distorted target shapes, and boundaries not being suitable for prediction masks, which limit the performance of existing methods on remote - sensing images. To solve the above problems, the authors propose a method named SegEarth - OV, which contains two main innovations: 1. **SimFeatUp**: This is a simple and general feature up - sampler, aiming to recover the spatial information in deep features in an unsupervised manner. By training on a small number of unlabeled images, SimFeatUp can upsample any remote - sensing image features, thus maintaining semantic consistency with the image content. 2. **Global Bias Mitigation**: The authors observe that in the CLIP model, local patch features are affected by global features, resulting in biased prediction results. For this reason, they propose a simple subtraction operation to reduce this bias by subtracting the global features from the local features. Through these two innovations, SegEarth - OV has carried out extensive experiments on 17 remote - sensing datasets, covering tasks such as semantic segmentation, building extraction, road detection, and flood detection. The experimental results show that SegEarth - OV significantly outperforms the existing state - of - the - art methods in multiple tasks, especially in single - class extraction tasks. In conclusion, the main contribution of this paper is to provide a training - free framework that can achieve high - quality open - vocabulary semantic segmentation in remote - sensing images, thereby reducing the dependence on large - scale annotated data and improving the segmentation accuracy.