Abstract:Object detection in remote sensing imagery plays a vital role in various Earth observation applications. However, unlike object detection in natural scene images, this task is particularly challenging due to the abundance of small, often barely visible objects across diverse terrains. To address these challenges, multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. Nonetheless, the performance of multimodal learning is often constrained by the limited size of labeled datasets. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data to enhance detection performance. However, conventional MIM such as MAE which uses masked tokens without any contextual information, struggles to capture the fine-grained details due to a lack of interactions with other parts of image. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing. The extensive ablation studies and evluation demonstrate the effectiveness of our approach.

What problem does this paper attempt to address?

This paper attempts to address the challenges faced in multi - modal object detection in remote sensing images. Specifically, the object detection tasks in remote sensing images have the following difficulties: 1. **Small and indistinguishable objects**: Remote sensing images often contain a large number of tiny and unobvious objects, which are distributed on different terrains, increasing the difficulty of detection. 2. **Data scarcity**: Compared with natural scene images, the labeled data of remote sensing images is very limited, which restricts the training effect of the model. 3. **Utilization of multi - modal data**: Although combining data of different modalities (such as RGB and infrared images) can provide more abundant information, the performance of existing methods in processing multi - modal data is limited because the available multi - modal labeled datasets are small. To solve these problems, the authors propose a new self - supervised learning (SSL) - based method - Interactive Masked Image Modeling (IMIM). This method introduces a cross - attention mechanism to establish an interaction between masked tokens and unmasked tokens, thereby better capturing the fine - grained details in the image. This helps to improve the detection accuracy of small or densely arranged objects in remote sensing images. ### Specific problems and solutions 1. **Small object detection problem**: - **Limitations of traditional MIM**: Traditional masked image modeling methods (such as MAE) use masked tokens without context information, resulting in an inability to fully capture the fine - grained details in the image. - **Advantages of interactive MIM**: By introducing a cross - attention mechanism, IMIM can establish a dependency relationship between masked tokens and unmasked tokens, enabling the encoder to learn more about the overall understanding of the image content, especially for small object detection. 2. **Data scarcity problem**: - **Application of self - supervised learning**: By using a large amount of unlabeled data for pre - training, self - supervised learning can significantly improve the generalization ability of the model in the case of data scarcity. The IMIM method proposed in this paper can achieve better performance in downstream tasks through self - supervised pre - training. 3. **Multi - modal data fusion problem**: - **Multi - modal feature fusion**: By combining RGB and infrared images, IMIM can use the information of different modalities to enhance the accuracy of object detection. Experimental results show that the use of multi - modal data significantly improves the mAP (mean Average Precision) metric. ### Main contributions 1. Proposed a new Interactive Masked Image Modeling method (IMIM), which solves the deficiency of traditional MIM methods in capturing fine - grained details by introducing a cross - attention mechanism. 2. Utilized self - supervised learning and multi - modal data fusion to effectively alleviate the data scarcity problem in object detection tasks in remote sensing images. 3. Verified the effectiveness of the proposed method through extensive ablation experiments and evaluations, especially in single - modal and multi - modal settings. In conclusion, this paper significantly improves the performance of object detection in remote sensing images through the innovative IMIM method, especially making important progress in dealing with small objects and multi - modal data.

Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing

Object-Centric Masked Image Modeling-Based Self-Supervised Pretraining for Remote Sensing Object Detection

CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding

Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset

Remote Sensing Scene Classification with Masked Image Modeling (MIM)

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification

OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling

Image Masking for Robust Self-Supervised Monocular Depth Estimation

MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling

Masked Image Modeling with Local Multi-Scale Reconstruction.

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

RemoteDet-Mamba: A Hybrid Mamba-CNN Network for Multi-modal Object Detection in Remote Sensing Images

Learning with Unmasked Tokens Drives Stronger Vision Learners

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Symmetric masking strategy enhances the performance of Masked Image Modeling

Exploring the Coordination of Frequency and Attention in Masked Image Modeling

Masked Image Modeling: A Survey

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images