SAM-RSIS: Progressively Adapting SAM With Box Prompting to Remote Sensing Image Instance Segmentation

Muying Luo,Tao Zhang,Shiqing Wei,Shunping Ji
DOI: https://doi.org/10.1109/tgrs.2024.3460085
IF: 8.2
2024-09-28
IEEE Transactions on Geoscience and Remote Sensing
Abstract:The recent segment anything model (SAM) trained on massive close-range images has demonstrated impressive performance on general segmentation or specific segmentation tasks with manual prompts. However, the significant domain shift problem between remote sensing and close-range images should be tackled before introducing the pretrained SAM to remote sensing instance segmentation (RSIS). To address this and unlock the potential of SAM in RSIS, this article proposes a novel framework called SAM for remote sensing instance segmentation (SAM-RSIS), which overcomes the problems in a few recent works that only adapt a part of SAM to remote sensing. SAM-RSIS fine-tunes the vision transformer (ViT) backbone and mask decoder of SAM progressively on remote sensing data and uses automatic box prompting to eliminate the need for manual prompting. SAM-RSIS consists of an object detection stage and a mask generation stage. In object detection, we introduce an adapter to adapt knowledge embedded in the pretrained ViT backbone to remote sensing images and then build an object detector. In mask generation, using the detected bounding boxes as prompts, along with two learnable mask output tokens, and the two-layer high-resolution features from the adapter, we fine-tune the mask decoder of SAM to produce high-quality masks. Experimental results on the WHU, WHU-Mix, and NWPU datasets for binary and multiclass RSIS demonstrate the effectiveness and robustness of the proposed method, surpassing various derivative methods of SAM and achieving performance comparable to and even better than the specific state-of-the-art instance segmentation methods.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?