OReole-FM: successes and challenges toward billion-parameter foundation models for high-resolution satellite imagery

Philipe Dias,Aristeidis Tsaris,Jordan Bowman,Abhishek Potnis,Jacob Arndt,H. Lexie Yang,Dalton Lunga
DOI: https://doi.org/10.1145/3678717.3691292
2024-10-26
Abstract:While the pretraining of Foundation Models (FMs) for remote sensing (RS) imagery is on the rise, models remain restricted to a few hundred million parameters. Scaling models to billions of parameters has been shown to yield unprecedented benefits including emergent abilities, but requires data scaling and computing resources typically not available outside industry R&D labs. In this work, we pair high-performance computing resources including Frontier supercomputer, America's first exascale system, and high-resolution optical RS data to pretrain billion-scale FMs. Our study assesses performance of different pretrained variants of vision Transformers across image classification, semantic segmentation and object detection benchmarks, which highlight the importance of data scaling for effective model scaling. Moreover, we discuss construction of a novel TIU pretraining dataset, model initialization, with data and pretrained models intended for public release. By discussing technical challenges and details often lacking in the related literature, this work is intended to offer best practices to the geospatial community toward efficient training and benchmarking of larger FMs.
Computer Vision and Pattern Recognition,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Application of large - scale models on remote sensing images**: Currently, the number of parameters in basic remote - sensing - image models (FMs) is usually limited to within a few hundred million parameters, while large - scale models (such as models with billions of parameters) have already shown significant advantages in the fields of natural - language processing and computer vision, including emergent abilities. Therefore, the paper aims to explore how to apply these advantages to the analysis of high - resolution satellite images. 2. **Challenges of data and computing resources**: Training large - scale models requires a large amount of data and high - performance computing resources, which are usually difficult to obtain in academia. By using Frontier, the first exascale supercomputer in the United States, the paper explores how to overcome these challenges and provides effective practical methods. 3. **Model architecture and pre - training strategies**: The paper introduces several Vision Transformer (ViT) variants of different sizes and evaluates their performance in tasks such as image classification, semantic segmentation, and object detection. In addition, the paper also discusses technical details in aspects such as model initialization, dataset construction, and pre - training strategies. 4. **Model generalization ability and label efficiency**: An important advantage of large - scale models is their generalization ability across different tasks and the reduced need for labeled data. The paper verifies these advantages through experiments and discusses how to further improve the model's generalization ability and label efficiency. 5. **Challenges of standardized benchmark testing**: The paper points out that many existing studies lack detailed information on reproducibility, especially in the evaluation of downstream tasks. Therefore, the paper emphasizes the importance of establishing standardized benchmark testing and puts forward some specific suggestions. In summary, the main purpose of this paper is to explore the application potential of large - scale models in high - resolution satellite - image analysis through their training and evaluation, and to provide relevant technical details and best practices to promote the further development of this field.