Abstract:The raw depth image captured by the indoor depth sensor usually has an extensive range of missing depth values due to inherent limitations such as the inability to perceive transparent objects and limited distance range. The incomplete depth map burdens many downstream vision tasks, and a rising number of depth completion methods have been proposed to alleviate this issue. While most existing methods can generate accurate dense depth maps from sparse and uniformly sampled depth maps, they are not suitable for complementing the large contiguous regions of missing depth values, which is common and critical. In this paper, we design a novel two-branch end-to-end fusion network, which takes a pair of RGB and incomplete depth images as input to predict a dense and completed depth map. The first branch employs an encoder-decoder structure to regress the local dense depth values from the raw depth map, with the help of local guidance information extracted from the RGB image. In the other branch, we propose an RGB-depth fusion GAN to transfer the RGB image to the fine-grained textured depth map. We adopt adaptive fusion modules named W-AdaIN to propagate the features across the two branches, and we append a confidence fusion head to fuse the two outputs of the branches for the final depth map. Extensive experiments on NYU-Depth V2 and SUN RGB-D demonstrate that our proposed method clearly improves the depth completion performance, especially in a more realistic setting of indoor environments with the help of the pseudo depth map.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the completion problem of depth maps in indoor environments**. Specifically, due to the limitations of indoor depth sensors (such as Kinect, RealSense, etc.) in perceiving transparent objects, reflective surfaces, or objects that are too far/too close, there are a large number of invalid or missing depth values in the captured original depth images. Such incomplete depth maps have a negative impact on downstream vision tasks (such as 3D reconstruction, indoor navigation, etc.). Therefore, this paper proposes a new RGB - Depth fusion generative adversarial network (GAN), aiming to predict dense and complete depth maps from RGB images and incomplete depth maps. ### Main contributions: 1. **Propose a novel end - to - end GAN architecture** that can effectively fuse the original depth map and RGB images to generate reasonable dense depth maps. 2. **Design a pseudo - depth map generation technique** to simulate the depth - missing distribution in real - scene, which significantly improves the model's depth - completion performance in indoor environments. 3. **Achieve state - of - the - art performance on the NYU - Depth V2 and SUN RGB - D datasets**, demonstrating the effectiveness of the method, especially in improving the performance of downstream tasks (such as object detection). ### Solution overview: - **Dual - branch structure**: One branch uses an encoder - decoder structure to regress local dense depth values from the original depth map and utilizes local guidance information extracted from RGB images; the other branch introduces an RGB - depth fusion GAN to convert RGB images into depth maps with fine - grained textures. - **Feature fusion module**: Through the local guidance module and the W - AdaIN module, share feature information at different stages to enhance the fusion effect of RGB and depth maps. - **Confidence fusion head**: Combine the outputs of the two branches to generate the final depth prediction result. - **Pseudo - depth map training strategy**: Generate more realistic depth - missing patterns through RGB images and semantic labels to improve the model's adaptability to actual indoor scenes. ### Formula representation: - **Loss function of RDF - GAN**: \[ L_D=\mathbb{E}_{d_{raw}\sim D_{raw}}[D(G(M(d_{raw}))|r]-\mathbb{E}_{d_{gt}\sim D_{gt}}[D(d_{gt}|r)] \] \[ L_G = \lambda_g\mathcal{L}_1(G(M(d_{raw}))-\mathbb{E}_{d_{raw}\sim D_{raw}}[D(G(M(d_{raw}))|r]] \] - **Final depth prediction formula**: \[ d_{pred}(i,j)=\frac{e^{c_l(i,j)}\cdot d_l(i,j)+e^{c_f(i,j)}\cdot d_f(i,j)}{e^{c_l(i,j)}+e^{c_f(i,j)}} \] Through these methods, this paper successfully solves the key problems in indoor depth - map completion, especially performing particularly well in dealing with large - scale missing areas.

RGB-Depth Fusion GAN for Indoor Depth Completion

RDFC-GAN: RGB-Depth Fusion CycleGAN for Indoor Depth Completion

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Least Square Estimation Network for Depth Completion

Depth Completion via Inductive Fusion of Planar LIDAR and Monocular Camera

AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion

Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion Network for Depth Completion

An Adaptive Fusion Algorithm for Depth Completion

Depth Cue Enhancement and Guidance Network for RGB-D Salient Object Detection

A Transformer-Based Image-Guided Depth-Completion Model with Dual-Attention Fusion Module

Agspn: Efficient Attention-Gated Spatial Propagation Network for Depth Completion

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

An Efficient Information-Reinforced Lidar Deep Completion Network without RGB Guided

Learning Guided Convolutional Network for Depth Completion

A Concise but High-performing Network for Image Guided Depth Completion in Autonomous Driving

RGB×D: Learning Depth-Weighted RGB Patches for RGB-D Indoor Semantic Segmentation

Learning an Efficient Multimodal Depth Completion Model

A Multi-Cue Guidance Network for Depth Completion

Multiscale Adaptation Fusion Networks for Depth Completion

Depth Images Could Tell Us More: Enhancing Depth Discriminability for RGB-D Scene Recognition