Abstract:Depth estimation using monocular vision sensors is crucial in computer vision, with diverse applications ranging from autonomous driving to robot motion. Conventional methods suffer from the trade-off between consistency and fine-grained details due to the local-receptive field limiting their practicality. This lack of long-range dependency inherently comes from the convolutional neural network (CNN) part of the architecture. In this article, a dual window transformer-based network, namely DwinFormer, is proposed, which utilizes both local and global features for depth estimation using monocular vision sensors. The DwinFormer consists of dual window self-attention (Dwin-SA) and cross-attention transformers, dual window self-attention transformer (Dwin-SAT) and dual window cross attention transformer (Dwin-CAT), respectively. The Dwin-SAT seamlessly extracts intricate, locally aware features while concurrently capturing global context. It harnesses the power of local and global window attention to adeptly capture both short-range and long-range dependencies, obviating the need for complex and computationally expensive operations, such as attention masking or window shifting. Moreover, Dwin-SAT introduces inductive biases which provide desirable properties, such as translational equivariance and less dependence on large-scale data. Furthermore, conventional decoding methods often rely on skip connections which may result in semantic discrepancies and a lack of global context when fusing encoder and decoder features. In contrast, the Dwin-CAT employs both local and global window cross-attention to seamlessly fuse encoder and decoder features with both fine-grained local and contextually aware global information, effectively amending semantic gap. Empirical evidence obtained through extensive experimentation on the NYU-Depth-V2 and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) datasets demonstrates the superiority of the proposed method, - onsistently outperforming existing approaches across both indoor and outdoor environments.

Event-based Monocular Depth Estimation with Recurrent Transformers

Event-based Monocular Dense Depth Estimation with Recurrent Transformers

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

Fusing Events and Frames with Coordinate Attention Gated Recurrent Unit for Monocular Depth Estimation

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction

Learning Monocular Dense Depth from Events

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Lightweight Monocular Depth Estimation with an Edge Guided Network

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

A Novel Spike Transformer Network for Depth Estimation from Event Cameras via Cross-modality Knowledge Distillation

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

DwinFormer: Dual Window Transformers for End-to-End Monocular Depth Estimation

Recurrent Vision Transformers for Object Detection with Event Cameras