Abstract:Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.

PCformer: A Parallel Convolutional Transformer Network for 360° Depth Estimation

PanoFormer: Panorama Transformer for Indoor 360 Depth Estimation

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

A transformer-CNN parallel network for image guided depth completion

GLPanoDepth: Global-to-Local Panoramic Depth Estimation

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Lightweight monocular depth estimation using a fusion-improved transformer

SDformer: Efficient End-to-End Transformer for Depth Completion

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Neural Contourlet Network for Monocular 360 Depth Estimation

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

ACDNet: Adaptively Combined Dilated Convolution for Monocular Panorama Depth Estimation

Conformer: Local Features Coupling Global Representations for Visual Recognition

A Transformer-Based Image-Guided Depth-Completion Model with Dual-Attention Fusion Module

EndoDepthL: Lightweight Endoscopic Monocular Depth Estimation with CNN-Transformer

Conformer: Local Features Coupling Global Representations for Recognition and Detection

Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer.