NeRFmentation: NeRF-based Augmentation for Monocular Depth Estimation

Casimir Feldmann,Niall Siegenheim,Nikolas Hars,Lovro Rabuzin,Mert Ertugrul,Luca Wolfart,Marc Pollefeys,Zuria Bauer,Martin R. Oswald
2024-09-16
Abstract:The capabilities of monocular depth estimation (MDE) models are limited by the availability of sufficient and diverse datasets. In the case of MDE models for autonomous driving, this issue is exacerbated by the linearity of the captured data trajectories. We propose a NeRF-based data augmentation pipeline to introduce synthetic data with more diverse viewing directions into training datasets and demonstrate the benefits of our approach to model performance and robustness. Our data augmentation pipeline, which we call \textit{NeRFmentation}, trains NeRFs on each scene in a dataset, filters out subpar NeRFs based on relevant metrics, and uses them to generate synthetic RGB-D images captured from new viewing directions. In this work, we apply our technique in conjunction with three state-of-the-art MDE architectures on the popular autonomous driving dataset, KITTI, augmenting its training set of the Eigen split. We evaluate the resulting performance gain on the original test set, a separate popular driving dataset, and our own synthetic test set.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the limitations of Monocular Depth Estimation (MDE) models in terms of performance due to insufficient dataset diversity and limited viewpoint variations. Specifically, the authors focus on: 1. **Dataset Limitations**: Existing MDE models typically rely on large-scale datasets for training, but these datasets often lack sufficient diversity and viewpoint variations. This is particularly true in autonomous driving scenarios where data trajectories are usually linear, leading to limited generalization capabilities of the models. 2. **Improving Generalization**: To enhance the generalization ability of MDE models in unseen environments, the authors propose a data augmentation method based on NeRF (Neural Radiance Fields), called NeRFmentation. This method augments existing training datasets by generating synthetic RGB-D images from new viewpoints, thereby improving the robustness and generalization ability of the models. ### Main Contributions 1. **NeRFmentation Data Augmentation Pipeline**: A simple yet effective data augmentation pipeline is proposed, utilizing NeRF to generate high-quality RGB-D images. This extends and diversifies existing real-world depth estimation datasets from new viewpoints to increase the robustness of the training models. 2. **Extensive Experimental Validation**: Various state-of-the-art MDE models trained on the KITTI dataset were evaluated in a zero-shot cross-dataset manner, particularly on the Waymo Open Dataset, demonstrating the effectiveness of NeRFmentation. 3. **Comparison with Other Data Augmentation Techniques**: Experimental results show that NeRFmentation outperforms other data augmentation techniques in both in-distribution and out-of-distribution testing tasks. Insights are provided on when to combine classical training-time augmentation techniques (such as random rotation, flipping, and cropping) with data augmentation. ### Method Overview 1. **Dataset Structure Analysis**: Many real-world image datasets have a sequential structure, capturing the same environment from multiple slightly different viewpoints. Using the provided pose information, a 3D model of the scene can be constructed, generating new RGB-D pairs to increase the size of the original dataset. 2. **NeRF Training and Filtering**: Each scene is divided into sub-scenes, and individual depth-supervised NeRFs are trained. After training, a small portion of the data is used as a validation set to filter out low-quality NeRFs. 3. **New Viewpoint Synthesis**: New viewpoints are generated by applying slight 3D rigid transformations to the original poses. The trained NeRFs are then used to render dense RGB-D images from these new viewpoints. These images are combined with the source dataset to form an augmented dataset for training downstream monocular depth estimation networks. ### Experimental Results 1. **Datasets**: The experiments primarily used the KITTI and Waymo Open Dataset. 2. **Data Augmentation Baseline Methods**: Comparisons were made with methods such as Fourier Domain Analysis (FDA) and style transfer. 3. **New Viewpoint Synthesis Strategies**: Five data augmentation strategies were proposed, including re-rendering training poses, interpolating new poses, horizontal and vertical translations, etc. The impact of depth completion and pose diversification was analyzed. 4. **Model Training and Evaluation**: Various MDE architectures (such as AdaBins, DepthFormer, and BinsFormer) were trained on the augmented KITTI dataset and evaluated on the Waymo and KITTI test sets, demonstrating the significant effect of NeRFmentation in improving model performance and generalization ability. ### Conclusion NeRFmentation effectively enhances the robustness and generalization ability of MDE models by generating synthetic data from new viewpoints, particularly in unseen environments. This method not only performs well in in-distribution testing tasks but also significantly outperforms other data augmentation techniques in out-of-distribution testing tasks.