Self-supervised Monocular Depth Estimation with Large Kernel Attention

Xuezhi Xiang,Yao Wang,Lei Zhang,Denis Ombati,Himaloy Himu,Xiantong Zhen
2024-09-26
Abstract:Self-supervised monocular depth estimation has emerged as a promising approach since it does not rely on labeled training data. Most methods combine convolution and Transformer to model long-distance dependencies to estimate depth accurately. However, Transformer treats 2D image features as 1D sequences, and positional encoding somewhat mitigates the loss of spatial information between different feature blocks, tending to overlook channel features, which limit the performance of depth estimation. In this paper, we propose a self-supervised monocular depth estimation network to get finer details. Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies without compromising the two-dimension structure of features while maintaining feature channel adaptivity. In addition, we introduce a up-sampling module to accurately recover the fine details in the depth map. Our method achieves competitive results on the KITTI dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issues of detail and accuracy in self-supervised monocular depth estimation. Specifically, existing methods often overlook channel features when handling long-range dependencies, leading to limited depth estimation performance. Additionally, the upsampling module performs poorly in recovering fine details of the depth map, often resulting in blurred edges. To address these issues, the paper proposes a self-supervised monocular depth estimation network based on Large Kernel Attention (LKA) and an upsampling module aimed at improving the accuracy and detail quality of depth estimation. ### Main Contributions: 1. **Depth Network Based on Large Kernel Attention**: By introducing the large kernel attention mechanism, the network can model long-range dependencies while maintaining the 2D structure of features and adapting to feature channels, thereby improving the accuracy of depth estimation. 2. **Upsampling Module**: An upsampling module is introduced that can more accurately recover details in the depth map, reduce edge blurring, and improve the accuracy of monocular depth estimation. 3. **Experimental Validation**: Extensive experiments show that the method achieves competitive results on the KITTI dataset, particularly excelling in metrics such as Absolute Relative Error (AbsRel), Squared Relative Error (SqRel), Root Mean Squared Error (RMSE), and Logarithmic Root Mean Squared Error (RMSElog). ### Method Overview: - **Overall Architecture**: The method includes a depth network and a pose network. The depth network adopts an encoder-decoder architecture, where the encoder uses HRNet18 and the decoder is based on the large kernel attention mechanism. The pose network uses ResNet18 to generate 6 degrees of freedom relative pose. - **Large Kernel Attention Mechanism**: By cascading depthwise separable convolutions and large kernel convolutions, LKA can capture long-range dependencies while maintaining the 2D structure of features and channel adaptability. - **Upsampling Module**: By generating offsets and adding them to the original sampling grid, this module can more accurately recover feature details and reduce edge blurring. ### Experimental Results: - **Quantitative Results**: Experimental results on the KITTI dataset show that the method outperforms existing methods on multiple evaluation metrics, particularly excelling in error metrics such as AbsRel, SqRel, RMSE, and RMSElog. - **Qualitative Results**: Compared to baseline methods and other classical methods, the depth maps generated by this method are clearer at boundaries (such as traffic signs, pedestrians, and roadside trees), with higher quality and sharper depth edges. In summary, the paper significantly improves the accuracy and detail quality of self-supervised monocular depth estimation by introducing the large kernel attention mechanism and the upsampling module.