Abstract:Remote sensing image scene classification is to annotate semantic categories for image areas covering multiple land cover types, reflecting the spatial aggregation of relevant social resources among feature objects, which is one of the remote sensing interpretation tasks with higher challenges for algorithms to understand the images. Nowadays, scene semantic information extraction of images using deep neural networks is also one of the hot research directions. In comparison to other algorithms, deep neural networks can better capture semantic information in images to achieve higher classification accuracy involved in applications such as urban planning. In recent years, multi-modal models represented by image-text have achieved satisfactory performance in downstream tasks. The introduction of "multi-modal" in the field of remote sensing research should not be limited to the use of multi-source data, but more importantly to the coding of diverse data and the extracted deep features based on the huge amount of data. Therefore, in this paper, based on an image-text matching model, we establish a multi-modal scene classification model (Fig. 1) for high spatial resolution aerial images which is dominated by image features and text provides facilitation for the representation of image features. The algorithm first employs self-supervised learning of the visual model, to align the expression domain of the image features obtained from training on natural images with that of our particular dataset, which will help to improve the feature extraction effectiveness of the aerial survey images on the visual model. The features generated by the pre-trained image encoding model and the text encoding model will be further aligned and some of the parameters in the image encoder will be iteratively updated during training. A valid classifier is designed at the end of the model to implement the scene classification task. Through experiments, it was found that our algorithm has a significant improvement effect on the task of scene categorization on aerial survey images compared to single visual models. The model presented in the article obtained precision and recall of above 90% on the test dataset, contained in the high spatial resolution aerial survey images dataset we built with 27 categories (Fig. 2). Fig 1. Diagram of the proposed model structure. Blue boxes are associated with the image, green boxes with the text, and red boxes with both image and text. Fig 2. Samples in our high spatial resolution aerial survey images dataset.

Improved Visual Vocabularies for Scene Classification of High Resolution Remote Sensing Imagery in Urban Areas

Visual Vocabulary Optimization with Spatial Context for Image Annotation and Classification

Bag-of-visual-words and Spatial Extensions for Land-Use Classification

Visual Words Refining Exploiting Spatial Co-Occurrence Table

Randomized Locality Sensitive Vocabularies For Bag-Of-Features Model

Massive-Scale Visual Information Retrieval towards City Residential Environment Surveillance

An Image Classification Method Based on Multiple Visual Dictionaries

Semantic and Spatial Co-Occurrence Analysis on Object Pairs for Urban Scene Classification.

Modeling spatial and semantic cues for large-scale near-duplicated image retrieval

Scene Classification Using Multi-Scale Deeply Described Visual Words

Vocabulary Hierarchy Optimisation Based on Spatial Context and Category Information

Auto‐encoder‐based Shared Mid‐level Visual Dictionary Learning for Scene Classification Using Very High Resolution Remote Sensing Images

Evaluating Bag-of-visual-words Representations in Scene Classification

Exploring Spatial Correlation for Visual Object Retrieval

Building Descriptive and Discriminative Visual Codebook for Large-Scale Image Applications.

Generating descriptive visual words and visual phrases for large-scale image applications

High-Resolution Remote Sensing Image Classification with RmRMR-Enhanced Bag of Visual Words

Large Visual Words For Large Scale Image Classification

A Multi-Modal High Spatial Resolution Aerial Imagery Scene Classification Model with Visual Enhancement

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

Urban Scene Classification with VHR Images