Abstract:Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.

What problem does this paper attempt to address?

The paper attempts to address the issues of detail and accuracy in self-supervised monocular depth estimation. Specifically, existing methods often overlook channel features when handling long-range dependencies, leading to limited depth estimation performance. Additionally, the upsampling module performs poorly in recovering fine details of the depth map, often resulting in blurred edges. To address these issues, the paper proposes a self-supervised monocular depth estimation network based on Large Kernel Attention (LKA) and an upsampling module aimed at improving the accuracy and detail quality of depth estimation. ### Main Contributions: 1. **Depth Network Based on Large Kernel Attention**: By introducing the large kernel attention mechanism, the network can model long-range dependencies while maintaining the 2D structure of features and adapting to feature channels, thereby improving the accuracy of depth estimation. 2. **Upsampling Module**: An upsampling module is introduced that can more accurately recover details in the depth map, reduce edge blurring, and improve the accuracy of monocular depth estimation. 3. **Experimental Validation**: Extensive experiments show that the method achieves competitive results on the KITTI dataset, particularly excelling in metrics such as Absolute Relative Error (AbsRel), Squared Relative Error (SqRel), Root Mean Squared Error (RMSE), and Logarithmic Root Mean Squared Error (RMSElog). ### Method Overview: - **Overall Architecture**: The method includes a depth network and a pose network. The depth network adopts an encoder-decoder architecture, where the encoder uses HRNet18 and the decoder is based on the large kernel attention mechanism. The pose network uses ResNet18 to generate 6 degrees of freedom relative pose. - **Large Kernel Attention Mechanism**: By cascading depthwise separable convolutions and large kernel convolutions, LKA can capture long-range dependencies while maintaining the 2D structure of features and channel adaptability. - **Upsampling Module**: By generating offsets and adding them to the original sampling grid, this module can more accurately recover feature details and reduce edge blurring. ### Experimental Results: - **Quantitative Results**: Experimental results on the KITTI dataset show that the method outperforms existing methods on multiple evaluation metrics, particularly excelling in error metrics such as AbsRel, SqRel, RMSE, and RMSElog. - **Qualitative Results**: Compared to baseline methods and other classical methods, the depth maps generated by this method are clearer at boundaries (such as traffic signs, pedestrians, and roadside trees), with higher quality and sharper depth edges. In summary, the paper significantly improves the accuracy and detail quality of self-supervised monocular depth estimation by introducing the large kernel attention mechanism and the upsampling module.

Self-supervised Monocular Depth Estimation with Large Kernel Attention

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Self-supervised Monocular Depth Estimation Based on Combining Convolution and Multilayer Perceptron

MonoBooster: Semi-Dense Skip Connection with Cross-Level Attention for Boosting Self-Supervised Monocular Depth Estimation

Self-supervised Monocular Image Depth Estimation Primed by Transformer and Multi-scale Attention Scheme

TinyDepth: Lightweight Self-Supervised Monocular Depth Estimation Based on Transformer

Self‐supervised Monocular Depth Estimation Via Asymmetric Convolution Block

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Attention-Based Dense Decoding Network for Monocular Depth Estimation

Depth Monocular Estimation with Attention-based Encoder-Decoder Network from Single Image

Self-Supervised Monocular Depth Estimation with Multi-constraints

Self-supervised Monocular Depth Estimation with Self-Distillation and Dense Skip Connection

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Complete contextual information extraction for self-supervised monocular depth estimation

A Self-Supervised Monocular Depth Estimation Method Based on High Resolution Convolutional Neural Network

Self-Supervised Monocular Depth Estimation Based on Channel Attention

Self-Supervised Monocular Depth Estimation Based on High-Order Spatial Interactions

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

Self-supervised monocular depth estimation via joint attention and intelligent mask loss

Unsupervised Monocular Depth Estimation with Encoder-decoder Network