Insight Any Instance: Promptable Instance Segmentation for Remote Sensing Images

Xuexue Li
2024-09-11
Abstract:Instance segmentation of remote sensing images (RSIs) is an essential task for a wide range of applications such as land planning and intelligent transport. Instance segmentation of RSIs is constantly plagued by the unbalanced ratio of foreground and background and limited instance size. And most of the instance segmentation models are based on deep feature learning and contain operations such as multiple downsampling, which is harmful to instance segmentation of RSIs, and thus the performance is still limited. Inspired by the recent superior performance of prompt learning in visual tasks, we propose a new prompt paradigm to address the above issues. Based on the existing instance segmentation model, firstly, a local prompt module is designed to mine local prompt information from original local tokens for specific instances; secondly, a global-to-local prompt module is designed to model the contextual information from the global tokens to the local tokens where the instances are located for specific instances. Finally, a proposal's area loss function is designed to add a decoupling dimension for proposals on the scale to better exploit the potential of the above two prompt modules. It is worth mentioning that our proposed approach can extend the instance segmentation model to a promptable instance segmentation model, i.e., to segment the instances with the specific boxes prompt. The time consumption for each promptable instance segmentation process is only 40 ms. The paper evaluates the effectiveness of our proposed approach based on several existing models in four instance segmentation datasets of RSIs, and thorough experiments prove that our proposed approach is effective for addressing the above issues and is a competitive model for instance segmentation of RSIs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in instance segmentation of remote sensing images (RSIs): 1. **Foreground - background pixel ratio imbalance**: In remote sensing images, the proportion of foreground pixels is usually much lower than that in natural scene images. This leads to a significant imbalance between the foreground and the background in the instance segmentation task. 2. **Limited instance size**: Instances in remote sensing images are often small and have a low proportion of foreground pixels, which imposes higher requirements on feature extraction. 3. **Limitations of existing models**: Most of the existing instance segmentation models are based on the deep - feature - learning framework and include multiple down - sampling operations. These operations are harmful to the instance - segmentation performance of remote sensing images because they will lose spatial - detail information and exacerbate the problems of foreground - background imbalance and limited instance size. To solve these problems, the author proposes a new prompt paradigm, which specifically includes the following aspects: ### 1. Local Prompt Module (LPM) This module aims to mine rich texture information in the local area where a specific instance is located from the original image to compensate for the information loss caused by down - sampling. The design of LPM is as follows: - Define the local area as the area where the instance exists in the remote sensing image. - Obtain the spatial - position coordinates of these local areas from the input box prompts and candidate boxes. - Use frequency - domain interaction to enhance the representation of these local areas. Through fast Fourier transform (FFT), adding learnable - parameter embedding, inverse fast Fourier transform (IFFT), and then performing multi - layer perceptron (MLP) calculation interaction, finally obtain the local prompt. ### 2. Global - to - local Prompt Module (GPM) This module aims to enhance the representation of small - size instances through global - context information. The specific design of GPM is as follows: - Divide the original image into global tokens as part of the input. - For another part of the input, that is, the tokens in the local area where the instance is located, GPM upsamples them to the same size and then divides them into partial tokens. - After adding position embedding, use the multi - head self - attention module in the Transformer structure for self - interaction, denoted as \( G \) and \( L \). - Perform global - to - local attention calculation \( G2LAttn(\cdot) \) to obtain the local token \( \tilde{L} \) that aggregates global - context information, as shown in the formula: \[ \tilde{L} = G2LAtt(G, L) = \text{softmax}\left(\frac{L G^T}{\sqrt{d_k}}\right) * G \] ### 3. Proposals’ Area Loss Function (PAreaLoss) To improve the quality of candidate boxes, the author introduces a new loss function PAreaLoss to optimize the accuracy of candidate boxes in the two - dimensional - space scale. The specific design is as follows: - During the model - training process, PAreaLoss calculates the deviation between the candidate boxes predicted by the model and the ground - truth boxes in scale area. - Use the IoU value as a criterion to assign candidate boxes to the ground - truth boxes. After the matching is completed, calculate the area deviation between the corresponding candidate boxes and the ground - truth boxes and normalize it, and finally calculate the average value. ### Summary The prompt paradigm proposed in this paper effectively solves the problems of foreground - background imbalance and limited instance size in instance segmentation of remote sensing images by combining the local - prompt module and the global - to - local - prompt module. At the same time, PAreaLoss further improves the quality of candidate boxes, thereby enhancing the overall instance - segmentation performance.