Multimodal Supervised Contrastive Learning in Remote Sensing Downstream Tasks

Paul Berg,Baki Uzun,Minh-Tan Pham,Nicolas Courty
DOI: https://doi.org/10.1109/lgrs.2024.3385995
IF: 5.343
2024-04-19
IEEE Geoscience and Remote Sensing Letters
Abstract:To leverage the large amount of unlabeled data available in remote sensing datasets, self-supervised learning (SSL) methods have recently emerged as an ubiquitous tool to pretrain robust image encoder models from unlabeled images. However, when used in a downstream setting, these models often need to be fine-tuned for a specific task after their pretraining. This fine-tuning still requires labeling information to train a classifier on top of the encoder while also updating the encoder weights. In this letter, we investigate the specific task of multimodal scene classification where a sample is composed of multiple views from multiple heterogeneous satellite sensors. We propose a method to improve the categorical cross-entropy fine-tuning process which is often used to specify the model for this downstream task. Our approach, based on the supervised contrastive (SupCon) learning, uses the label information available to train an image encoder in a contrastive manner from multiple modalities while also training the task-specific classifier online. Such a multimodal SupCon loss helps better align representations from samples coming from multiple sensors but having the same class labels, thus improving the performance of the fine-tuning process. Our experiments on two public datasets including DFC2020 and Meter-ML with Sentinel-1/Sentinel-2 images show a significant gain over the baseline multimodal cross-entropy loss. All codes and datasets will be made publicly available for reproducibility upon acceptance.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?