Abstract:Existing self-supervised monocular depth estimation methods can get rid of expensive annotations and achieve promising results. However, these methods suffer from severe performance degradation when directly adopting a model trained on a fixed resolution to evaluate at other different resolutions. In this paper, we propose a resolution adaptive self-supervised monocular depth estimation method (RA-Depth) by learning the scale invariance of the scene depth. Specifically, we propose a simple yet efficient data augmentation method to generate images with arbitrary scales for the same scene. Then, we develop a dual high-resolution network that uses the multi-path encoder and decoder with dense interactions to aggregate multi-scale features for accurate depth inference. Finally, to explicitly learn the scale invariance of the scene depth, we formulate a cross-scale depth consistency loss on depth predictions with different scales. Extensive experiments on the KITTI, Make3D and NYU-V2 datasets demonstrate that RA-Depth not only achieves state-of-the-art performance, but also exhibits a good ability of resolution adaptation.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in self - supervised monocular depth estimation, the performance of existing methods drops severely when tested at different resolutions. Specifically, when directly using a model trained at a fixed resolution to evaluate data at other different resolutions, the performance of existing methods will decrease significantly. Therefore, the paper proposes a resolution - adaptive self - supervised monocular depth estimation method (RA - Depth), aiming to improve the adaptability of the model at different resolutions. ### Main Contributions 1. **Solve the Image Resolution Adaptation Problem for the First Time**: As far as the authors know, RA - Depth is the first work to solve the image resolution adaptation problem in self - supervised monocular depth estimation. 2. **Arbitrary - scale Data Augmentation Method**: An arbitrary - scale data augmentation method is proposed. By randomly adjusting the size, cropping and splicing of the image, different - scale images of the same scene are generated, prompting the model to implicitly learn the scale invariance of the scene depth. 3. **Efficient Dual - High - Resolution Network**: An efficient dual - high - resolution network (Dual HRNet) is developed. This network has the ability of multi - scale feature fusion and can fully extract and aggregate multi - scale features, so as to perform depth estimation more accurately. 4. **Cross - scale Depth Consistency Loss**: A new cross - scale depth consistency loss is proposed, which explicitly learns the scale invariance of the scene depth, enabling the model to predict consistent depth maps on input images of different scales. ### Method Overview 1. **Arbitrary - scale Data Augmentation**: - For the original image \(I\), three training images \(I_L\), \(I_M\) and \(I_H\) at different scales are generated by random scaling, cropping and splicing. - The specific steps are as follows: ```markdown 1. Randomly initialize the scale factors \(s_L\) and \(s_H\), with the ranges of [0.7, 0.9] and [1.1, 2.0] respectively. 2. Calculate the image sizes \((h_L, w_L)\), \((h_M, w_M)\) and \((h_H, w_H)\) at different scales. 3. Generate the low - resolution image \(I_L\) by image splicing. 4. Generate the medium - resolution image \(I_M\) by direct copying. 5. Generate the high - resolution image \(I_H\) by random cropping. ``` 2. **Dual High - Resolution Network (Dual HRNet)**: - Use HRNet18 as an encoder and inherit its advantage of multi - scale feature fusion. - Design an efficient decoder (HRDecoder) to gradually fuse low - scale features while maintaining high - resolution feature representation. - The specific calculation formula for feature fusion is as follows: ```markdown \[ \begin{aligned} d_1^i &= \text{CONV}_{3\times3}(e_i), & i &= 1, 2, 3, 4 \\ d_m^{1i} &= d_1^i+\left[\text{CONV}_{1\times1}(\mu(d_1^k))\right], & i &= 1, 2, 3, k = i + 1,\ldots,4 \end{aligned} \] ``` 3. **Cross - scale Depth Consistency Loss**: - Calculate the cross - scale depth consistency losses \(L_{cs}^{LM}\) and \(L_{cs}^{MH}\), which constrain the model to explicitly learn the scale invariance of the scene depth. - The specific calculation formula is as follows: ```markdown \[ L_{cs}^{MH}=\alpha\left(\frac{1-\text{SSIM}(\tilde{D}_t^M,\tilde{D}_t^ ```

RA-Depth: Resolution Adaptive Self-Supervised Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Unsupervised Learning

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

Resolution-sensitive self-supervised monocular absolute depth estimation

HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation

Monocular Depth Estimation via Self-Supervised Self-Distillation

Self-Supervised Monocular Depth Estimation With Multiscale Perception

Unsupervised Scale-Consistent Depth Learning from Video

HA-Bins: Hierarchical Adaptive Bins for Robust Monocular Depth Estimation across Multiple Datasets

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics

A self‐supervised monocular depth estimation model with scale recovery and transfer learning for construction scene analysis

Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets

Self-supervised monocular depth estimation based on image texture detail enhancement

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Digging Into Self-Supervised Monocular Depth Estimation

Boosting Monocular Depth Estimation with Sparse Guided Points