Frequency-Aware Multi-Modal Fine-Tuning for Few-Shot Open-Set Remote Sensing Scene Classification

Junjie Zhang,Yutao Rao,Xiaoshui Huang,Guanyi Li,Xin Zhou,Dan Zeng
DOI: https://doi.org/10.1109/tmm.2024.3372416
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Few-shot open-set recognition, as a new paradigm, leveraging a limited amount of supervised data to identify specific Remote Sensing (RS) scene categories and generalize to novel ones. However, the data bias induced by the small sample size not only causes severe overfitting within base classes, but also impairs the capacity for inference to identify RS scenes in hitherto unobserved categories. Furthermore, owing to environmental influences, RS images frequently manifest notable intra-class disparities and comparatively low inter-class distinctions, intensifying the challenge in obtaining suitable classifiers. To address above issues, we investigate the utilization of a Multi-modal Foundational Model (MFM) infused with essential domain knowledge to mitigate the generalization limitations encountered in few-shot scenarios. Recognizing that existing MFMs with a visual-text dual-branch structure are primarily tailored for natural scenes, we propose a custom Frequency Distribution-based Multi-modal Fine-Tuning strategy (FreqDiMFT) in a parameter-efficient manner. More specifically, within the vision branch, we address the high inter-class similarity and intra-class diversity in RS images by embedding the local-global frequency distribution information to facilitate the recognition of RS scenes. To further amplify the model's generalization ability post transfer, we introduce an adaptive feature refinement module designed for Transformers, proficient in filtering redundant features resulting from domain disparities. To mitigate the domain drift on the textual branch, we adopt an input format that combines basic templates with domain expertise from RS end to generate more discriminative class prototypes. To fully verify the effectiveness of our FreqDiMFT in a more practical setting, we collect a Large-Scale hybrid dataset (LSRS). Extensive experiments demonstrate that, even with a scant number of training samples, our strategy yields advanced performances compared to state-of-the-art models.
What problem does this paper attempt to address?