Abstract:Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, deep-convolutional-neural-networks based methods have achieved promising results in the monocular depth estimation field. Specifically, the framework that combines the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) has gained the significant improvement in the dense labeling task. However, the discretized and predefined dilation rates cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts in depth estimation. In this paper, we propose an attention-based context aggregation network (ACAN) to tackle these difficulties. Based on the self-attention model, ACAN adaptively learns the task-specific similarities between pixels to model the context information. First, we recast the monocular depth estimation as a dense labeling multi-class classification problem. Then we propose a soft ordinal inference to transform the predicted probabilities to continuous depth values, which can reduce the discretization error (about 1% decrease in RMSE). Second, the proposed ACAN aggregates both the image-level and pixel-level context information for depth estimation, where the former expresses the statistical characteristic of the whole image and the latter extracts the long-range spatial dependencies for each pixel. Third, for further reducing the inconsistency between the RGB image and depth map, we construct an attention loss to minimize their information entropy. We evaluate on public monocular depth-estimation benchmark datasets (including NYU Depth V2, KITTI). The experiments demonstrate the superiority of our proposed ACAN and achieve the competitive results with the state of the arts.

CI-Net: a joint depth estimation and semantic segmentation network using contextual information

ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation.

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Attention-based Multi-modal Fusion Network for Semantic Scene Completion.

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

HCNet: Hierarchical Context Network for Semantic Segmentation

Context-Aware Interaction Network for RGB-T Semantic Segmentation

Adaptive Context-Aware Multi-Modal Network for Depth Completion

Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary

Edge-Enhanced GCIFFNet: A Multiclass Semantic Segmentation Network Based on Edge Enhancement and Multiscale Attention Mechanism

Category-Based Interactive Attention and Perception Fusion Network for Semantic Segmentation of Remote Sensing Images

SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular Images

THCANet: Two-layer hop cascaded asymptotic network for robot-driving road-scene semantic segmentation in RGB-D images

Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation

Attention-based Context Aggregation Network for Monocular Depth Estimation

CDMANet: central difference mutual attention network for RGB-D semantic segmentation

BMSeNet: Multiscale Context Pyramid Pooling and Spatial Detail Enhancement Network for Real-Time Semantic Segmentation

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

CCNet: Criss-Cross Attention for Semantic Segmentation