Abstract:Depth estimation using monocular vision sensors is crucial in computer vision, with diverse applications ranging from autonomous driving to robot motion. Conventional methods suffer from the trade-off between consistency and fine-grained details due to the local-receptive field limiting their practicality. This lack of long-range dependency inherently comes from the convolutional neural network (CNN) part of the architecture. In this article, a dual window transformer-based network, namely DwinFormer, is proposed, which utilizes both local and global features for depth estimation using monocular vision sensors. The DwinFormer consists of dual window self-attention (Dwin-SA) and cross-attention transformers, dual window self-attention transformer (Dwin-SAT) and dual window cross attention transformer (Dwin-CAT), respectively. The Dwin-SAT seamlessly extracts intricate, locally aware features while concurrently capturing global context. It harnesses the power of local and global window attention to adeptly capture both short-range and long-range dependencies, obviating the need for complex and computationally expensive operations, such as attention masking or window shifting. Moreover, Dwin-SAT introduces inductive biases which provide desirable properties, such as translational equivariance and less dependence on large-scale data. Furthermore, conventional decoding methods often rely on skip connections which may result in semantic discrepancies and a lack of global context when fusing encoder and decoder features. In contrast, the Dwin-CAT employs both local and global window cross-attention to seamlessly fuse encoder and decoder features with both fine-grained local and contextually aware global information, effectively amending semantic gap. Empirical evidence obtained through extensive experimentation on the NYU-Depth-V2 and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) datasets demonstrates the superiority of the proposed method, - onsistently outperforming existing approaches across both indoor and outdoor environments.

MDEConvFormer: estimating monocular depth as soft regression based on convolutional transformer

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Lightweight monocular depth estimation using a fusion-improved transformer

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Complete contextual information extraction for self-supervised monocular depth estimation

> ? ∗ > 0 B ? ∗ > 0 C ? ∗ > 0 DEC Conv = Full-image Encoder Conv Conv Conv Conv Conv Conv Convs ASPP # Dense Feature Extractor Scene Understanding Modular Ordinal Regression Input Output

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Monocular image depth estimation using dilated convolution and spatial pyramid polling structure

DwinFormer: Dual Window Transformers for End-to-End Monocular Depth Estimation

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

A Monocular Depth Estimation Method for Indoor-Outdoor Scenes Based on Vision Transformer

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One