Abstract:The generation of topographic classification maps or relative heights from aerial or remote sensing images represents a crucial research tool in remote sensing. On the one hand, from auto-driving, three-dimensional city modeling, road design, and resource statistics to smart cities, each task requires relative height data and classification data of objects. On the other hand, most relative height data acquisition methods currently use multiple images. We find that relative height and geographic classification data can be mutually assisted through data distribution. In recent years, with the rapid development of artificial intelligence technology, it has become possible to estimate the relative height from a single image. It learns implicit mapping relationships in a data-driven manner that may not be explicitly available through mathematical modeling. On this basis, we propose a unified, in-depth learning structure that can generate both estimated relative height maps and semantically segmented maps and perform end-to-end training. Compared with the existing methods, our task is to perform both relative height estimation and semantic segmentation tasks simultaneously. We only need one picture to obtain the corresponding semantically segmented images and relative heights simultaneously. The model's performance is much better than that of equivalent computational models. We also designed dynamic weights to enable the model to learn relative height estimation and semantic segmentation simultaneously. At the same time, we have conducted good experiments on existing datasets. The experimental results show that the proposed Transformer-based network architecture is suitable for relative height estimation tasks and vastly outperforms other state-of-the-art DL (Deep Learning) methods.

Joint Task-Recursive Learning for RGB-D Scene Understanding

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Joint-Confidence-Guided Multi-Task Learning for 3D Reconstruction and Understanding from Monocular Camera

Semantic Reconstruction based on RGB Image and Sparse Depth

Collaborative Learning of Depth Estimation, Visual Odometry and Camera Relocalization from Monocular Videos.

Multi-branch Collaborative Learning Network for 3D Visual Grounding

Self-supervised Recurrent Visual Odometry, Depth Estimation, and Instance Segmentation

Multi-Task Learning of Relative Height Estimation and Semantic Segmentation from Single Airborne RGB Images

Multi-resolution Cascaded Network with Depth-similar Residual Module for Real-time Semantic Segmentation on RGB-D Images.

Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary

To Complete or to Estimate, That is the Question: A Multi-Task Approach to Depth Completion and Monocular Depth Estimation

LRRU: Long-short Range Recurrent Updating Networks for Depth Completion

Cross-Dimensional Refined Learning for Real-Time 3D Visual Perception from Monocular Video

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

Joint Object Segmentation and Depth Upsampling

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Recursive noisy label learning paradigm based on confidence measurement for semi-supervised depth completion

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

Progressive Recurrent Learning for Visual Recognition.

Learning Monocular Depth in Dynamic Environment via Context-aware Temporal Attention