Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Jielu Zhang,Zhongliang Zhou,Gengchen Mai,Mengxuan Hu,Zihan Guan,Sheng Li,Lan Mu
2024-08-25
Abstract:Remote sensing imagery has attracted significant attention in recent years due to its instrumental role in global environmental monitoring, land usage monitoring, and more. As image databases grow each year, performing automatic segmentation with deep learning models has gradually become the standard approach for processing the data. Despite the improved performance of current models, certain limitations remain unresolved. Firstly, training deep learning models for segmentation requires per-pixel annotations. Given the large size of datasets, only a small portion is fully annotated and ready for training. Additionally, the high intra-dataset variance in remote sensing data limits the transfer learning ability of such models. Although recently proposed generic segmentation models like SAM have shown promising results in zero-shot instance-level segmentation, adapting them to semantic segmentation is a non-trivial task. To tackle these challenges, we propose a novel method named Text2Seg for remote sensing semantic segmentation. Text2Seg overcomes the dependency on extensive annotations by employing an automatic prompt generation process using different visual foundation models (VFMs), which are trained to understand semantic information in various ways. This approach not only reduces the need for fully annotated datasets but also enhances the model's ability to generalize across diverse datasets. Evaluations on four widely adopted remote sensing datasets demonstrate that Text2Seg significantly improves zero-shot prediction performance compared to the vanilla SAM model, with relative improvements ranging from 31% to 225%. Our code is available at <a class="link-external link-https" href="https://github.com/Douglas2Code/Text2Seg" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address several key issues in semantic segmentation of remote sensing images: 1. **High Data Annotation Cost**: Training deep learning models for segmentation requires pixel-level annotations. Due to the large size of datasets, only a small amount of data is fully annotated, which limits the scale of model training. 2. **High Variability Within Datasets**: Remote sensing data exhibits significant differences in terms of sensors, geographic locations, time, etc., which limits the model's transfer learning capability. 3. **Challenges of Zero-Shot Segmentation**: Although recently proposed general segmentation models (such as SAM) perform well in zero-shot instance-level segmentation, applying them to semantic segmentation remains challenging. To address these challenges, the authors propose a new method called Text2Seg for semantic segmentation of remote sensing images. Text2Seg reduces the reliance on large amounts of annotated data and improves the model's generalization ability across different datasets by automatically generating prompts using various Visual Foundation Models (VFMs). Experimental results show that Text2Seg significantly improves zero-shot prediction performance on four widely used remote sensing datasets, with relative improvements ranging from 31% to 225%.