Abstract:Remote sensing image scene classification is to annotate semantic categories for image areas covering multiple land cover types, reflecting the spatial aggregation of relevant social resources among feature objects, which is one of the remote sensing interpretation tasks with higher challenges for algorithms to understand the images. Nowadays, scene semantic information extraction of images using deep neural networks is also one of the hot research directions. In comparison to other algorithms, deep neural networks can better capture semantic information in images to achieve higher classification accuracy involved in applications such as urban planning. In recent years, multi-modal models represented by image-text have achieved satisfactory performance in downstream tasks. The introduction of "multi-modal" in the field of remote sensing research should not be limited to the use of multi-source data, but more importantly to the coding of diverse data and the extracted deep features based on the huge amount of data. Therefore, in this paper, based on an image-text matching model, we establish a multi-modal scene classification model (Fig. 1) for high spatial resolution aerial images which is dominated by image features and text provides facilitation for the representation of image features. The algorithm first employs self-supervised learning of the visual model, to align the expression domain of the image features obtained from training on natural images with that of our particular dataset, which will help to improve the feature extraction effectiveness of the aerial survey images on the visual model. The features generated by the pre-trained image encoding model and the text encoding model will be further aligned and some of the parameters in the image encoder will be iteratively updated during training. A valid classifier is designed at the end of the model to implement the scene classification task. Through experiments, it was found that our algorithm has a significant improvement effect on the task of scene categorization on aerial survey images compared to single visual models. The model presented in the article obtained precision and recall of above 90% on the test dataset, contained in the high spatial resolution aerial survey images dataset we built with 27 categories (Fig. 2). Fig 1. Diagram of the proposed model structure. Blue boxes are associated with the image, green boxes with the text, and red boxes with both image and text. Fig 2. Samples in our high spatial resolution aerial survey images dataset.

Multi-modal Remote Sensing Image Description Based on Word Embedding and Self-Attention Mechanism

Remote Sensing Scene Image Classification Model Based on Multi-Scale Features and Attention Mechanism

Pixel-Level Remote Sensing Image Recognition Based on Bidirectional Word Vectors

A Multi-Modal High Spatial Resolution Aerial Imagery Scene Classification Model with Visual Enhancement

Remote Sensing Image Description Based on Word Embedding and End-to-end Deep Learning

Deep Semantic Understanding of High Resolution Remote Sensing Image

Hierarchical Self-Attention Embedded Neural Network With Dense Connection for Remote-Sensing Image Semantic Segmentation

Remote Sensing Image Scene Classification Based on Global Self-Attention Module

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Multi-Attention-Network for Semantic Segmentation of Fine Resolution Remote Sensing Images

Semantic segmentation of remote sensing images based on dual‐channel attention mechanism

Remote Sensing Time Series Classification Based on Self-Attention Mechanism and Time Sequence Enhancement

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images

Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

Self-supervision assisted multimodal remote sensing image classification with coupled self-looping convolution networks

Cross‐modal retrieval with dual multi‐angle self‐attention

Multi-modal remote sensing image segmentation based on attention-driven dual-branch encoding framework

Multi-modal gated recurrent units for image description

MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis