Abstract:Few-shot open-set recognition, as a new paradigm, leveraging a limited amount of supervised data to identify specific Remote Sensing (RS) scene categories and generalize to novel ones. However, the data bias induced by the small sample size not only causes severe overfitting within base classes, but also impairs the capacity for inference to identify RS scenes in hitherto unobserved categories. Furthermore, owing to environmental influences, RS images frequently manifest notable intra-class disparities and comparatively low inter-class distinctions, intensifying the challenge in obtaining suitable classifiers. To address above issues, we investigate the utilization of a Multi-modal Foundational Model (MFM) infused with essential domain knowledge to mitigate the generalization limitations encountered in few-shot scenarios. Recognizing that existing MFMs with a visual-text dual-branch structure are primarily tailored for natural scenes, we propose a custom Frequency Distribution-based Multi-modal Fine-Tuning strategy (FreqDiMFT) in a parameter-efficient manner. More specifically, within the vision branch, we address the high inter-class similarity and intra-class diversity in RS images by embedding the local-global frequency distribution information to facilitate the recognition of RS scenes. To further amplify the model's generalization ability post transfer, we introduce an adaptive feature refinement module designed for Transformers, proficient in filtering redundant features resulting from domain disparities. To mitigate the domain drift on the textual branch, we adopt an input format that combines basic templates with domain expertise from RS end to generate more discriminative class prototypes. To fully verify the effectiveness of our FreqDiMFT in a more practical setting, we collect a Large-Scale hybrid dataset (LSRS). Extensive experiments demonstrate that, even with a scant number of training samples, our strategy yields advanced performances compared to state-of-the-art models.

A Multi-Modal Unified Representation Learning Framework with Masked Image Modeling for Remote Sensing Images

A Unified Multimodal Deep Learning Framework for Remote Sensing Imagery Classification.

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Transfer Representation Learning Meets Multimodal Fusion Classification for Remote Sensing Images

Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing

Learning Unified Sparse Representations For Multi-Modal Data

Elevation Information-Guided Multimodal Fusion Robust Framework for Remote Sensing Image Segmentation

Masked Feature Modeling for Generative Self-Supervised Representation Learning of High-Resolution Remote Sensing Images

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

MMS-EF: A Multi-Scale Modular Extraction Framework for Enhancing Deep Learning Models in Remote Sensing

OpticalRS-4M: Scaling Efficient Masked Autoencoder Learning on Large Remote Sensing Dataset

Frequency-Aware Multi-Modal Fine-Tuning for Few-Shot Open-Set Remote Sensing Scene Classification

More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Multi-Resolution Multi-Modal Sensor Fusion For Remote Sensing Data With Label Uncertainty

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification

Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing

Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval

Transformer-based Multi-Modal Learning for Multi Label Remote Sensing Image Classification

Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset

MMT: Mixed-Mask Transformer for Remote Sensing Image Semantic Segmentation