In-Domain Self-Supervised Learning Improves Remote Sensing Image Scene Classification

Ivica Dimitrovski,Ivan Kitanovski,Nikola Simidjievski,Dragi Kocev
DOI: https://doi.org/10.1109/LGRS.2024.3352926
2024-02-05
Abstract:We investigate the utility of in-domain self-supervised pre-training of vision models in the analysis of remote sensing imagery. Self-supervised learning (SSL) has emerged as a promising approach for remote sensing image classification due to its ability to exploit large amounts of unlabeled data. Unlike traditional supervised learning, SSL aims to learn representations of data without the need for explicit labels. This is achieved by formulating auxiliary tasks that can be used for pre-training models before fine-tuning them on a given downstream task. A common approach in practice to SSL pre-training is utilizing standard pre-training datasets, such as ImageNet. While relevant, such a general approach can have a sub-optimal influence on the downstream performance of models, especially on tasks from challenging domains such as remote sensing. In this paper, we analyze the effectiveness of SSL pre-training by employing the iBOT framework coupled with Vision transformers trained on Million-AID, a large and unlabeled remote sensing dataset. We present a comprehensive study of different self-supervised pre-training strategies and evaluate their effect across 14 downstream datasets with diverse properties. Our results demonstrate that leveraging large in-domain datasets for self-supervised pre-training consistently leads to improved predictive downstream performance, compared to the standard approaches found in practice.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use In - Domain Self - Supervised Learning (ID - SSL) to improve the performance of the model in downstream tasks in remote sensing image analysis. Specifically, the paper explores whether self - supervised pre - training using large - scale unlabeled remote sensing datasets can achieve better results in remote sensing image scene classification tasks than the traditional method of pre - training using general datasets such as ImageNet. ### Background and Problem of the Paper With the increasing abundance of remote sensing data and the development of artificial intelligence, especially computer vision technology, deep - learning models are more and more widely used in remote sensing image analysis and have achieved many remarkable results. However, for some highly relevant tasks (such as archaeological site identification, pasture grazing management, agricultural fertilization, etc.), due to the lack of large - scale, public, and labeled datasets, the application of these tasks is limited. This is mainly because labeling large - scale datasets is usually an expensive, cumbersome, time - consuming, and mainly manual - dependent process. Therefore, researchers have begun to explore Self - Supervised Learning (SSL) methods to reduce the dependence on labeled data while maintaining the performance of the model. ### Research Objectives The main research question in this paper is: **Can the use of in - domain self - supervised learning continuously improve the performance of remote sensing tasks?** To evaluate this, the author conducted a comprehensive experimental study using 14 different downstream datasets, which are diverse in terms of the number of images, spatial resolution, number of labels, and distribution. ### Methods and Materials 1. **Self - supervised Pre - training Framework**: The author selected the iBOT framework combined with Vision Transformer (ViT) as the base model for self - supervised pre - training. The iBOT framework effectively improves the pre - training effect of the model through Masked Image Modeling (MIM) and self - distillation techniques. 2. **Unlabeled Dataset**: The Million - AID dataset was used, which is a dataset containing 1,000,848 non - overlapping remote sensing scene images with image sizes ranging from 110×110 to 31,672×31,672 pixels. 3. **Downstream Tasks**: The image scene classification tasks were mainly evaluated, including Multi - Class Classification (MCC) and Multi - Label Classification (MLC). Specific datasets include Eurosat, UC Merced, AID, etc. ### Experimental Setup 1. **Pre - training Strategies**: - Use the iBOT framework and the Million - AID dataset for in - domain self - supervised pre - training. - Use the ImageNet - 1K dataset for standard self - supervised or fully - supervised pre - training. 2. **Downstream Task Evaluation**: - Apply the pre - trained model to 14 downstream datasets through two strategies: Linear Probing and Fine - Tuning. 3. **Evaluation Metrics**: - For multi - class classification tasks, use Accuracy as the evaluation metric. - For multi - label classification tasks, use macro - averaged mean - average precision (mAP) as the evaluation metric. ### Results The experimental results show that the model pre - trained with in - domain self - supervised learning shows a significant performance improvement in all downstream tasks. Compared with the model pre - trained with ImageNet, the in - domain self - supervised pre - trained model has an average improvement of about 1% in multi - class classification tasks and about 2% in multi - label classification tasks. In addition, a fine - grained analysis shows that the in - domain self - supervised pre - trained model performs particularly well in dealing with sparsely labeled labels and can more accurately identify specific objects in the image. ### Conclusions This paper, through comprehensive experimental research, has proven that pre - training using in - domain self - supervised learning can significantly improve remote sensing images.