Abstract:Existing self-supervised monocular depth estimation methods can get rid of expensive annotations and achieve promising results. However, these methods suffer from severe performance degradation when directly adopting a model trained on a fixed resolution to evaluate at other different resolutions. In this paper, we propose a resolution adaptive self-supervised monocular depth estimation method (RA-Depth) by learning the scale invariance of the scene depth. Specifically, we propose a simple yet efficient data augmentation method to generate images with arbitrary scales for the same scene. Then, we develop a dual high-resolution network that uses the multi-path encoder and decoder with dense interactions to aggregate multi-scale features for accurate depth inference. Finally, to explicitly learn the scale invariance of the scene depth, we formulate a cross-scale depth consistency loss on depth predictions with different scales. Extensive experiments on the KITTI, Make3D and NYU-V2 datasets demonstrate that RA-Depth not only achieves state-of-the-art performance, but also exhibits a good ability of resolution adaptation.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in self - supervised monocular depth estimation, the performance of existing methods drops severely when tested at different resolutions. Specifically, when directly using a model trained at a fixed resolution to evaluate data at other different resolutions, the performance of existing methods will decrease significantly. Therefore, the paper proposes a resolution - adaptive self - supervised monocular depth estimation method (RA - Depth), aiming to improve the adaptability of the model at different resolutions.
### Main Contributions
1. **Solve the Image Resolution Adaptation Problem for the First Time**: As far as the authors know, RA - Depth is the first work to solve the image resolution adaptation problem in self - supervised monocular depth estimation.
2. **Arbitrary - scale Data Augmentation Method**: An arbitrary - scale data augmentation method is proposed. By randomly adjusting the size, cropping and splicing of the image, different - scale images of the same scene are generated, prompting the model to implicitly learn the scale invariance of the scene depth.
3. **Efficient Dual - High - Resolution Network**: An efficient dual - high - resolution network (Dual HRNet) is developed. This network has the ability of multi - scale feature fusion and can fully extract and aggregate multi - scale features, so as to perform depth estimation more accurately.
4. **Cross - scale Depth Consistency Loss**: A new cross - scale depth consistency loss is proposed, which explicitly learns the scale invariance of the scene depth, enabling the model to predict consistent depth maps on input images of different scales.
### Method Overview
1. **Arbitrary - scale Data Augmentation**:
- For the original image \(I\), three training images \(I_L\), \(I_M\) and \(I_H\) at different scales are generated by random scaling, cropping and splicing.
- The specific steps are as follows:
```markdown
1. Randomly initialize the scale factors \(s_L\) and \(s_H\), with the ranges of [0.7, 0.9] and [1.1, 2.0] respectively.
2. Calculate the image sizes \((h_L, w_L)\), \((h_M, w_M)\) and \((h_H, w_H)\) at different scales.
3. Generate the low - resolution image \(I_L\) by image splicing.
4. Generate the medium - resolution image \(I_M\) by direct copying.
5. Generate the high - resolution image \(I_H\) by random cropping.
```
2. **Dual High - Resolution Network (Dual HRNet)**:
- Use HRNet18 as an encoder and inherit its advantage of multi - scale feature fusion.
- Design an efficient decoder (HRDecoder) to gradually fuse low - scale features while maintaining high - resolution feature representation.
- The specific calculation formula for feature fusion is as follows:
```markdown
\[
\begin{aligned}
d_1^i &= \text{CONV}_{3\times3}(e_i), & i &= 1, 2, 3, 4 \\
d_m^{1i} &= d_1^i+\left[\text{CONV}_{1\times1}(\mu(d_1^k))\right], & i &= 1, 2, 3, k = i + 1,\ldots,4
\end{aligned}
\]
```
3. **Cross - scale Depth Consistency Loss**:
- Calculate the cross - scale depth consistency losses \(L_{cs}^{LM}\) and \(L_{cs}^{MH}\), which constrain the model to explicitly learn the scale invariance of the scene depth.
- The specific calculation formula is as follows:
```markdown
\[
L_{cs}^{MH}=\alpha\left(\frac{1-\text{SSIM}(\tilde{D}_t^M,\tilde{D}_t^
```