Abstract:Scene understanding of remote sensing images is of great significance in various applications. Its fundamental problem is how to construct representative features. Various convolutional neural network architectures have been proposed for automatically learning features from images. However, is the current way of configuring the same architecture to learn all the data while ignoring the differences between images the right one? It seems to be contrary to our intuition: it is clear that some images are easier to recognize, and some are harder to recognize. This problem is the gap between the characteristics of the images and the learning features corresponding to specific network structures. Unfortunately, the literature so far lacks an analysis of the two. In this paper, we explore this problem from three aspects: we first build a visual-based evaluation pipeline of scene complexity to characterize the intrinsic differences between images; then, we analyze the relationship between semantic concepts and feature representations, i.e., the scalability and hierarchy of features which the essential elements in CNNs of different architectures, for remote sensing scenes of different complexity; thirdly, we introduce CAM, a visualization method that explains feature learning within neural networks, to analyze the relationship between scenes with different complexity and semantic feature representations. The experimental results show that a complex scene would need deeper and multi-scale features, whereas a simpler scene would need lower and single-scale features. Besides, the complex scene concept is more dependent on the joint semantic representation of multiple objects. Furthermore, we propose the framework of scene complexity prediction for an image and utilize it to design a depth and scale-adaptive model. It achieves higher performance but with fewer parameters than the original model, demonstrating the potential significance of scene complexity.

A Single-Stream Adaptive Scene Layout Modeling Method for Scene Recognition

Modeling Spatial Layout for Scene Image Understanding Via a Novel Multiscale Sum-Product Network

LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation.

Scene Recognition by Manifold Regularized Deep Learning Architecture

A new representation of scene layout improves saliency detection in traffic scenes

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling

A holistic representation guided attention network for scene text recognition

Locally Supervised Deep Hybrid Model for Scene Recognition

Double-Keggin-type Anion-templated Synthesis of a 3D Porous Coordination Polymer of Eu(III) Ions and dpdo Ligands

Reconfigurable models for scene recognition

Self-Selection Salient Region-Based Scene Recognition Using Slight-Weight Convolutional Neural Network

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

Superpixel Segmentation Based Structural Scene Recognition.

SRRM: Semantic Region Relation Model for Indoor Scene Recognition

Compositional scene modeling with global object-centric representations

Scene Complexity: A New Perspective on Understanding the Scene Semantics of Remote Sensing and Designing Image-Adaptive Convolutional Neural Networks

Attention Pyramid Module for Scene Recognition

Joint learning of video scene detection and annotation via multi-modal adaptive context network

Saliency Prediction with Scene Structural Guidance

Joint global metric learning and local manifold preservation for scene recognition