One for All: Toward Unified Foundation Models for Earth Vision

Zhitong Xiong,Yi Wang,Fahong Zhang,Xiao Xiang Zhu
2024-05-28
Abstract:Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. Current remote sensing foundation models typically specialize in a single modality or a specific spatial resolution range, limiting their versatility for downstream datasets. While there have been attempts to develop multi-modal remote sensing foundation models, they typically employ separate vision encoders for each modality or spatial resolution, necessitating a switch in backbones contingent upon the input data. To address this issue, we introduce a simple yet effective method, termed OFA-Net (One-For-All Network): employing a single, shared Transformer backbone for multiple data modalities with different spatial resolutions. Using the masked image modeling mechanism, we pre-train a single Transformer backbone on a curated multi-modal dataset with this simple design. Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. The proposed method is evaluated on 12 distinct downstream tasks and demonstrates promising performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of current remote - sensing fundamental models when dealing with multi - modal and different spatial - resolution data. Specifically, existing remote - sensing fundamental models are usually dedicated to a single modality or a specific spatial - resolution range, which restricts their applicability and flexibility in a variety of downstream tasks. Moreover, although there have been attempts to develop multi - modal remote - sensing fundamental models, these models usually need to use different visual encoders for each modality or spatial resolution, resulting in the need to switch the backbone network when processing different input data, which affects the flexibility and operational efficiency of the model. To meet this challenge, the paper proposes a new method - OFA - Net (One - For - All Network), which uses a single shared Transformer backbone network to process multiple data modalities and different spatial resolutions, thereby achieving a unified fundamental model framework. This method not only simplifies the model design but also improves the model's adaptability and performance in different downstream tasks. The OFA - Net proposed in the paper was evaluated on 12 different downstream tasks and demonstrated its excellent performance on multiple tasks.