Abstract:In recent years, remote sensing (RS) vision foundation models, such as RingMo, have emerged and achieved excellent performance in various downstream tasks. However, the high demand for computing resources limits the application of these models on edge devices. It is necessary to design a more lightweight foundation model to support on-orbit RS image interpretation. Existing methods face challenges in achieving lightweight solutions while retaining generalization in RS image interpretation. This is due to the complex high-frequency (H-F) and low-frequency (L-F) spectral components in RS images, which make traditional single convolutional neural network (CNN) or vision Transformer methods unsuitable for the task. Therefore, this article proposes RingMo-lite, an RS lightweight network with a CNN-Transformer hybrid framework, which effectively exploits the frequency-domain properties of RS to optimize the interpretation process on several tasks like classification, object detection, semantic segmentation, and change detection. It is combined by the Transformer module as a low-pass filter to extract global features of RS images through a dual-branch structure and the CNN module as a stacked high-pass filter to extract fine-grained details effectively. Furthermore, a novelty-designed frequency-domain masked image modeling (FD-MIM) is employed during the pretraining stage for self-supervised learning, which combines the H-F and L-F characteristics of each image patch. This approach effectively captures the latent feature representation in RS data. Compared with RingMo, the proposed RingMo-lite reduces the parameters by over 60% in various RS image interpretation tasks, and the average accuracy drops by less than 2% in most of the scenes and achieves state-of-the-art (SOTA) performance compared to models of similar size. In addition, our work will be integrated into the MindSpore computing platform in the near future.

RingMo-Lite: A Remote Sensing Lightweight Network With CNN-Transformer Hybrid Framework

RingMo: A Remote Sensing Foundation Model with Masked Image Modeling

RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning

A lightweight and stochastic depth residual attention network for remote sensing scene classification

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

RingMo-SAM: A Foundation Model for Segment Anything in Multimodal Remote-Sensing Images

MFCANet: Multiscale Feature Context Aggregation Network for Oriented Object Detection in Remote-Sensing Images

Co-Training Transformer for Remote Sensing Image Classification, Segmentation, and Detection

CSCNN: Lightweight Modulation Recognition Model for Mobile Multimedia Intelligent Information Processing

CMR-net: A cross modality reconstruction network for multi-modality remote sensing classification

An Effective and Lightweight Hybrid Network for Object Detection in Remote Sensing Images

A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

Remote sensing image instance segmentation network with transformer and multi-scale feature representation

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

CD-CTFM: A Lightweight CNN-Transformer Network for Remote Sensing Cloud Detection Fusing Multiscale Features

Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images

An Efficient and Lightweight Convolutional Neural Network for Remote Sensing Image Scene Classification

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

Efficient Transformer for Remote Sensing Image Segmentation