Abstract:Self-supervised learning methods are increasingly important for monocular depth estimation since they do not require ground-truth data during training. Although existing methods have achieved great success for better monocular depth estimation based on Convolutional Neural Networks (CNNs), the limited receptive field of CNNs usually is insufficient to effectively model the global information, e.g., relationship between foreground and background or relationship among objects, which are crucial for accurately capturing scene structure. Recently, some studies based on Transformers have attracted significant interest in computer vision. However, duo to the lack of spatial locality bias, they may fail to model the local information, e.g., fine-grained details with an image. To tackle these issues, we propose a novel self-supervised learning framework by incorporating the advantages of both the CNNs and Transformers so as to model the complete contextual information for high-quality monocular depth estimation. Specifically, the proposed method mainly includes two branches, where the Transformer branch is considered to capture the global information while the Convolution branch is exploited to preserve the local information. We also design a rectangle convolution module with pyramid structure to perceive the semi-global information, e.g. thin objects, along the horizontal and vertical directions within an image. Moreover, we propose a shape refinement module by learning the affinity matrix between pixel and its neighborhood to obtain accurate geometrical structure of scenes. Extensive experiments evaluated on KITTI, Cityscapes and Make3D dataset demonstrate that the proposed method achieves the competitive result compared with the state-of-the-art self-supervised monocular depth estimation methods and shows good cross-dataset generalization ability.

Combing Transformer and CNN for Monocular Depth Estimation

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Complete contextual information extraction for self-supervised monocular depth estimation

Swin-Depth: Using Transformers and Multi-Scale Fusion for Monocular-Based Depth Estimation

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

TinyDepth: Lightweight Self-Supervised Monocular Depth Estimation Based on Transformer

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Monocular image depth estimation using dilated convolution and spatial pyramid polling structure

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

Super-Resolution for Monocular Depth Estimation with Multi-Scale Sub-Pixel Convolutions and a Smoothness Constraint.

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Lightweight monocular depth estimation using a fusion-improved transformer

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

Lightweight Monocular Depth Estimation with an Edge Guided Network

A transformer-CNN parallel network for image guided depth completion

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion