Abstract:Although deep learning has revolutionized remote sensing (RS) image scene classification, current deep learning-based approaches highly depend on the massive supervision of predetermined scene categories and have disappointingly poor performance on new categories that go beyond predetermined scene categories. In reality, the classification task often has to be extended along with the emergence of new applications that inevitably involve new categories of RS image scenes, so how to make the deep learning model own the inference ability to recognize the RS image scenes from unseen categories, which do not overlap the predetermined scene categories in the training stage, becomes incredibly important. By fully exploiting the RS domain characteristics, this paper constructs a new remote sensing knowledge graph (RSKG) from scratch to support the inference recognition of unseen RS image scenes. To improve the semantic representation ability of RS-oriented scene categories, this paper proposes to generate a Semantic Representation of scene categories by representation learning of RSKG (SR-RSKG). To pursue robust cross-modal matching between visual features and semantic representations, this paper proposes a novel deep alignment network (DAN) with a series of well-designed optimization constraints, which can simultaneously address zero-shot and generalized zero-shot RS image scene classification. Extensive experiments on one merged RS image scene dataset, which is the integration of multiple publicly open datasets, show that the recommended SR-RSKG obviously outperforms the traditional knowledge types (e.g., natural language processing models and manually annotated attribute vectors), and our proposed DAN shows better performance compared with the state-of-the-art methods under both the zero-shot and generalized zero-shot RS image scene classification settings. The constructed RSKG will be made publicly available along with this paper (https://github.com/kdy2021/SR-RSKG).

Mining Contrastive Relations Between Cross-Modal Features for Zero-Shot Remote Sensing Image Scene Classification

Integrating Adversarial Generative Network with Variational Autoencoders Towards Cross-Modal Alignment for Zero-Shot Remote Sensing Image Scene Classification

Zero-Shot Scene Classification for High Spatial Resolution Remote Sensing Images

Text guided zero-shot scene classification of high spatial resolution remote sensing images

See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.

Cross-Domain Few-Shot Hyperspectral Image Classification With Cross-Modal Alignment and Supervised Contrastive Learning

Deep Semantic-Visual Alignment for Zero-Shot Remote Sensing Image Scene Classification

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

Few-Shot Remote Sensing Scene Classification With Spatial Affinity Attention and Class Surrogate-Based Supervised Contrastive Learning

Multi-level Relation Learning for Cross-domain Few-shot Hyperspectral Image Classification

Contrastive Constrained Cross-Scene Model- Informed Interpretable Classification Strategy for Hyperspectral and LiDAR Data

Attention-Based Contrastive Learning for Few-Shot Remote Sensing Image Classification

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

Robust deep alignment network with remote sensing knowledge graph for zero-shot and generalized zero-shot remote sensing image scene classification

Mining on Heterogeneous Manifolds for Zero-Shot Cross-Modal Image Retrieval

Feature Transformation for Cross-domain Few-shot Remote Sensing Scene Classification

Alignment and Fusion Using Distinct Sensor Data for Multimodal Aerial Scene Classification

Triplet Contrastive Learning Framework With Adversarial Hard-Negative Sample Generation for Multimodal Remote Sensing Images

Understanding Dark Scenes by Contrasting Multi-Modal Observations