Abstract:We introduce a novel strategy for learning to extract semantically meaningful features from aerial imagery. Instead of manually labeling the aerial imagery, we propose to predict (noisy) semantic features automatically extracted from co-located ground imagery. Our network architecture takes an aerial image as input, extracts features using a convolutional neural network, and then applies an adaptive transformation to map these features into the ground-level perspective. We use an end-to-end learning approach to minimize the difference between the semantic segmentation extracted directly from the ground image and the semantic segmentation predicted solely based on the aerial image. We show that a model learned using this strategy, with no additional training, is already capable of rough semantic labeling of aerial imagery. Furthermore, we demonstrate that by finetuning this model we can achieve more accurate semantic segmentation than two baseline initialization strategies. We use our network to address the task of estimating the geolocation and geoorientation of a ground image. Finally, we show how features extracted from an aerial image can be used to hallucinate a plausible ground-level panorama.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the problem of predicting ground scene layouts from aerial imagery. Specifically, the authors propose a novel method that leverages existing ground image semantic segmentation techniques to automatically extract semantic features from aerial images without manual annotation. This approach can significantly reduce the cost of data annotation and improve the model's generalization ability across different datasets. ### Main Contributions 1. **Novel Convolutional Neural Network (CNN) Architecture**: This architecture can relate the appearance of aerial images to the semantic layout of ground images at the same location. 2. **Value of Pre-training Strategy**: Demonstrates how to pre-train a CNN to understand aerial images using this strategy. 3. **Extension to Other Tasks**: Applies this technique to ground image localization, orientation estimation, and synthesis tasks. 4. **Evaluation on Large-scale Real-world Datasets**: Conducts extensive evaluations of each technique, validating their effectiveness in practical applications. ### Method Overview 1. **Dataset**: The authors collected approximately 1.5 million pairs of geo-tagged ground and aerial images from the CVUSA dataset for training and testing. 2. **Network Architecture**: - **Feature Extraction**: Uses the VGG16 architecture to extract features from aerial images and convert them into hypercolumns. - **Cross-view Semantic Transformation**: Transforms the semantic labels of aerial images into ground-view semantic labels through a linear operation. - **Transformation Matrix**: Estimates the transformation matrix using a neural network, which depends on the content and pixel position of the aerial images. 3. **Training Process**: Minimizes the difference between the semantic segmentation of ground images and the predicted semantic segmentation from aerial images using an end-to-end learning approach. ### Applications and Evaluation 1. **Weakly Supervised Learning**: Demonstrates how to perform semantic segmentation without manually annotated aerial images. 2. **Pre-training**: Evaluates the effectiveness of this method as a pre-training strategy using the ISPRS dataset, showing superior performance compared to traditional initialization methods. 3. **Geolocation**: Shows how to estimate the orientation and position of ground images using the estimated ground feature maps. 4. **Generating Ground Images**: Proposes a new application of generating ground panoramas from features extracted by the network. ### Conclusion This paper presents a novel method that automatically extracts semantic features from aerial images by leveraging ground image semantic segmentation techniques. This method not only reduces the cost of data annotation but also improves the model's generalization ability across different datasets. Experimental results show that this method performs excellently on multiple tasks, providing new insights into the understanding of aerial images.

Predicting Ground-Level Scene Layout from Aerial Imagery

Scribble-Supervised Segmentation of Aerial Building Footprints Using Adversarial Learning

Aerial-PASS: Panoramic Annular Scene Segmentation in Drone Videos

Wide-Area Image Geolocalization with Aerial Reference Imagery

Efficient geospatial mapping of buildings, woodlands, water and roads from aerial imagery using deep learning

Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery

Semantic-aware Network for Aerial-to-Ground Image Synthesis

Attention GANs: Unsupervised Deep Feature Learning for Aerial Scene Classification.

Multispectral Semantic Land Cover Segmentation From Aerial Imagery With Deep Encoder–Decoder Network

Semantic 3D Reconstruction with Learning MVS and 2D Segmentation of Aerial Images

Informative Path Planning for Active Learning in Aerial Semantic Mapping

Classification of Very-High-Spatial-Resolution Aerial Images Based on Multiscale Features with Limited Semantic Information

Local Semantic Enhanced ConvNet for Aerial Scene Recognition

Edge-Semantic Learning Strategy for Layout Estimation in Indoor Environment

Aerial-DEM Geolocalization for GPS-Denied UAS Navigation

Contextual Hourglass Network for Semantic Segmentation of High Resolution Aerial Imagery

An Aerial Image Segmentation Approach Based on Enhanced Multi-Scale Convolutional Neural Network

Multi-Task Learning of Height and Semantics from Aerial Images

Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations

Bottom-up Estimation of Geometric Layout for Indoor Images

Predicting Vegetation Stratum Occupancy from Airborne LiDAR Data with Deep Learning