Abstract:Monocular depth estimation is known as an ill-posed task in which objects in a 2D image usually do not contain sufficient information to predict their depth. Thus, it acts differently from other tasks (e.g., classification and segmentation) in many ways. In this paper, we find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency in the feature representation. But the current backbones borrowed from other tasks pay less attention to handling different types of environmental information, limiting the overall depth accuracy. To bridge this gap, we propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth feature representation in two aspects. First, we propose a direction-aware module, which can learn to adjust the feature extraction in each direction, facilitating the encoding of different types of information. Secondly, we design a new cumulative convolution to improve the efficiency for aggregating important environmental information. Experiments show that our method achieves significant improvements on three widely used benchmarks, KITTI, Cityscapes, and Make3D, setting a new state-of-the-art performance on the popular benchmarks with all three types of self-supervision.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address the issues of direction sensitivity and environment dependency in monocular depth estimation. Specifically, the authors found that existing self-supervised monocular depth estimation methods are insufficient when dealing with different types of environmental information, which limits the overall depth estimation accuracy. To bridge this gap, the authors propose a new Direction-aware Cumulative Convolution Network (DaCCN) to improve the efficiency and accuracy of depth feature representation.
### Main Contributions
1. **Analysis of Direction Sensitivity and Environment Dependency**:
- The authors provide a detailed analysis of direction sensitivity and environment dependency in self-supervised monocular depth estimation tasks and propose a new Direction-aware Cumulative Convolution Network (DaCCN) to better represent depth features.
2. **Direction-aware Module**:
- The authors discovered that features in different directions in an image play different roles in depth prediction and designed a learnable direction-aware module to adjust the sampling density and receptive field for each direction.
3. **Cumulative Convolution Operation**:
- The authors propose a new cumulative convolution operation to efficiently aggregate critical environmental information from the connected region (i.e., the area between the camera and the object).
4. **Experimental Validation**:
- Experimental results show that the proposed method achieves significant performance improvements on three widely used benchmark datasets: KITTI, Cityscapes, and Make3D, and sets new state-of-the-art levels in all three types of self-supervised methods.
### Background and Motivation
Monocular depth estimation is an important computer vision task, especially in the field of autonomous driving. Unlike stereo matching methods, monocular depth estimation does not require rectified images, making it easier to apply to autonomous vehicles. However, since the pixel information of a single object is insufficient to predict its depth, monocular depth estimation is considered an ill-posed task. Therefore, the model heavily relies on the interrelationship between objects and the environment.
### Method Overview
1. **Direction-aware Module**:
- By inserting a learnable affinity transformation at the beginning of each branch, the input is transformed into a direction-aware information encoding space.
- A back projection is added at the end of each branch to maintain consistency between the features and the input image.
- The direction-aware module can adjust the sampling density and receptive field based on the features in each direction, thereby extracting information more effectively.
2. **Cumulative Convolution**:
- The cumulative convolution operation is divided into three parts: spatial convolution, accumulator, and normalization.
- Spatial convolution extracts local spatial information and modulates features.
- The accumulator accumulates features from the bottom to the current pixel, expanding the receptive field along the bottom line to cover the entire connected region.
- The normalization operation normalizes the accumulated features based on their positions in the feature map to avoid pixel value imbalance.
### Experimental Results
- **KITTI Dataset**:
- On the Eigen split, DaCCN achieves the best performance across all three self-supervised methods and two different input resolutions.
- Particularly in terms of squared relative error (SqRel) and root mean square error (RMSE), the improvements are significant, indicating that the new method can address some of the challenges in the original models.
- **Make3D and Cityscapes Datasets**:
- DaCCN also performs excellently on these two datasets, validating its robustness and effectiveness in different scenarios.
### Conclusion
By introducing the direction-aware module and cumulative convolution operation, this paper effectively addresses the issues of direction sensitivity and environment dependency in self-supervised monocular depth estimation, significantly improving the accuracy and robustness of depth estimation.