Abstract:Many computer vision tasks, such as monocular depth estimation and height estimation from a satellite orthophoto, have a common underlying goal, which is regression of dense continuous values for the pixels given a single image. We define them as dense continuous-value regression (DCR) tasks. Recent approaches based on deep convolutional neural networks significantly improve the performance of DCR tasks, particularly on pixelwise regression accuracy. However, it still remains challenging to simultaneously preserve the global structure and fine object details in complex scenes. In this article, we take advantage of the efficiency of Laplacian pyramid on representing multiscale contents to reconstruct high-quality signals for complex scenes. We design a Laplacian pyramid neural network (LAPNet), which consists of a Laplacian pyramid decoder (LPD) for signal reconstruction and an adaptive dense feature fusion (ADFF) module to fuse features from the input image. More specifically, we build an LPD to effectively express both global and local scene structures. In our LPD, the upper and lower levels, respectively, represent scene layouts and shape details. We introduce a residual refinement module to progressively complement high-frequency details for signal prediction at each level. To recover the signals at each individual level in the pyramid, an ADFF module is proposed to adaptively fuse multiscale image features for accurate prediction. We conduct comprehensive experiments to evaluate a number of variants of our model on three important DCR tasks, i.e., monocular depth estimation, single-image height estimation, and density map estimation for crowd counting. Experiments demonstrate that our method achieves new state-of-the-art performance in both qualitative and quantitative evaluation on the NYU-D V2 and KITTI for monocular depth estimation, the challenging Urban Semantic 3D (US3D) for satellite height estimation, and four challenging benchmarks for crowd counting. These results demonstrate that the proposed LAPNet is a universal and effective architecture for DCR problems.

Dynamic Parallel Pyramid Networks for Scene Recognition

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Dynamic Feature Pyramid Networks for Detection

DMPNet: Distributed Multi-Scale Pyramid Network for Real-Time Semantic Segmentation

DeepScene: Scene classification via convolutional neural network with spatial pyramid pooling

Attention Pyramid Module for Scene Recognition

Object Detection in Remote Sensing Images Based on a Scene-Contextual Feature Pyramid Network

EPRNet: Efficient Pyramid Representation Network for Real-Time Street Scene Segmentation

An End-to-End Trainable Multi-Column CNN for Scene Recognition in Extremely Changing Environment

DCPNet: A Densely Connected Pyramid Network for Monocular Depth Estimation

Laplacian Pyramid Neural Network for Dense Continuous-Value Regression for Complex Scenes

Self-Selection Salient Region-Based Scene Recognition Using Slight-Weight Convolutional Neural Network

SRNet: A 3D Scene Recognition Network Using Static Graph and Dense Semantic Fusion.

Scene recognition using multiple representation network

MM-FPN: Multi-path and Multi-scale Feature Pyramid Network for Object Detection

Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution Cnns

DPNet: Dual-Path Network for Real-Time Object Detection With Lightweight Attention

Dual Attention Based Image Pyramid Network for Object Detection.

Dual Complementary Dynamic Convolution for Image Recognition

Dense Feature Pyramid Deep Completion Network