Abstract:Remote sensing image scene classification is to annotate semantic categories for image areas covering multiple land cover types, reflecting the spatial aggregation of relevant social resources among feature objects, which is one of the remote sensing interpretation tasks with higher challenges for algorithms to understand the images. Nowadays, scene semantic information extraction of images using deep neural networks is also one of the hot research directions. In comparison to other algorithms, deep neural networks can better capture semantic information in images to achieve higher classification accuracy involved in applications such as urban planning. In recent years, multi-modal models represented by image-text have achieved satisfactory performance in downstream tasks. The introduction of "multi-modal" in the field of remote sensing research should not be limited to the use of multi-source data, but more importantly to the coding of diverse data and the extracted deep features based on the huge amount of data. Therefore, in this paper, based on an image-text matching model, we establish a multi-modal scene classification model (Fig. 1) for high spatial resolution aerial images which is dominated by image features and text provides facilitation for the representation of image features. The algorithm first employs self-supervised learning of the visual model, to align the expression domain of the image features obtained from training on natural images with that of our particular dataset, which will help to improve the feature extraction effectiveness of the aerial survey images on the visual model. The features generated by the pre-trained image encoding model and the text encoding model will be further aligned and some of the parameters in the image encoder will be iteratively updated during training. A valid classifier is designed at the end of the model to implement the scene classification task. Through experiments, it was found that our algorithm has a significant improvement effect on the task of scene categorization on aerial survey images compared to single visual models. The model presented in the article obtained precision and recall of above 90% on the test dataset, contained in the high spatial resolution aerial survey images dataset we built with 27 categories (Fig. 2). Fig 1. Diagram of the proposed model structure. Blue boxes are associated with the image, green boxes with the text, and red boxes with both image and text. Fig 2. Samples in our high spatial resolution aerial survey images dataset.

A Hierarchical and Contextual Model for Aerial Image Parsing

Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition.

Learning hierarchical poselets for human parsing

Aerial-PASS: Panoramic Annular Scene Segmentation in Drone Videos

Hierarchical Object Parsing from Structured Noisy Point Clouds

Geometric Scene Parsing with Hierarchical LSTM

A hierarchical and contextual model for learning and recognizing highly variant visual categories

A Causal Model of Recursive Scene Parsing in Human Perception

Single-Image 3D Scene Parsing Using Geometric Commonsense

Context-Adaptive Deep Learning for Efficient Image Parsing in Remote Sensing: An Automated Parameter Selection Approach

Image Parsing Via Stochastic Scene Grammar

Aerial Scene Parsing: From Tile-level Scene Classification to Pixel-wise Semantic Labeling

Dual Local-Global Contextual Pathways for Recognition in Aerial Imagery

Image Parsing: Unifying Segmentation, Detection, and Recognition.

Scene Parsing by Integrating Function, Geometry and Appearance Models

Unified Perceptual Parsing for Scene Understanding

Hierarchical space tiling for scene modeling

A Multi-Modal High Spatial Resolution Aerial Imagery Scene Classification Model with Visual Enhancement

Hierarchical Human Semantic Parsing With Comprehensive Part-Relation Modeling

Hierarchical Human Parsing with Typed Part-Relation Reasoning

Explicable Fine-Grained Aircraft Recognition Via Deep Part Parsing Prior Framework for High-Resolution Remote Sensing Imagery