One for All: Toward Unified Foundation Models for Earth Vision

Zhitong Xiong,Yi Wang,Fahong Zhang,Xiao Xiang Zhu

2024-05-28

Abstract:Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. Current remote sensing foundation models typically specialize in a single modality or a specific spatial resolution range, limiting their versatility for downstream datasets. While there have been attempts to develop multi-modal remote sensing foundation models, they typically employ separate vision encoders for each modality or spatial resolution, necessitating a switch in backbones contingent upon the input data. To address this issue, we introduce a simple yet effective method, termed OFA-Net (One-For-All Network): employing a single, shared Transformer backbone for multiple data modalities with different spatial resolutions. Using the masked image modeling mechanism, we pre-train a single Transformer backbone on a curated multi-modal dataset with this simple design. Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. The proposed method is evaluated on 12 distinct downstream tasks and demonstrates promising performance.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of current remote - sensing fundamental models when dealing with multi - modal and different spatial - resolution data. Specifically, existing remote - sensing fundamental models are usually dedicated to a single modality or a specific spatial - resolution range, which restricts their applicability and flexibility in a variety of downstream tasks. Moreover, although there have been attempts to develop multi - modal remote - sensing fundamental models, these models usually need to use different visual encoders for each modality or spatial resolution, resulting in the need to switch the backbone network when processing different input data, which affects the flexibility and operational efficiency of the model. To meet this challenge, the paper proposes a new method - OFA - Net (One - For - All Network), which uses a single shared Transformer backbone network to process multiple data modalities and different spatial resolutions, thereby achieving a unified fundamental model framework. This method not only simplifies the model design but also improves the model's adaptability and performance in different downstream tasks. The OFA - Net proposed in the paper was evaluated on 12 different downstream tasks and demonstrated its excellent performance on multiple tasks.

One for All: Toward Unified Foundation Models for Earth Vision

Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation

Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation

One to Transfer All: A Universal Transfer Framework for Vision Foundation Model with Few Data

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Toward Foundation Models for Earth Monitoring: Proposal for a Climate Change Benchmark

One for All: A Mutual Enhancement Method for Object Detection and Semantic Segmentation

Enabling Foundation Models: A Distributed Collaboration Framework Based on Graph Federated Learning

Foundation Models for Remote Sensing and Earth Observation: A Survey

A Billion-scale Foundation Model for Remote Sensing Images

OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery

Transformer-Based Few-Shot Object Detection with Multi-Relation Matching for Remote Sensing Images

Once-for-All: Train One Network and Specialize it for Efficient Deployment

SatVision-TOA: A Geospatial Foundation Model for Coarse-Resolution All-Sky Remote Sensing Imagery

RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks

Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI

A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval

On the Opportunities and Challenges of Foundation Models for GeoAI (Vision Paper)

Foundation Models for Generalist Geospatial Artificial Intelligence

ViM: Vision Middleware for Unified Downstream Transferring