Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Boris Chidlovskii,Leonid Antsfeld
2024-06-17
Abstract:For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the **simultaneous estimation problems of monocular depth estimation and visual odometry (VO)**. Specifically, the author proposes a Transformer model based on self - supervised learning to simultaneously estimate the depth information of the scene and camera motion without labeled data. #### Main problem background 1. **High cost of data labeling**: Traditional supervised learning methods require a large amount of labeled data for training, which is very expensive and time - consuming in practical applications. 2. **Complexity of geometric tasks**: Depth estimation and visual odometry are two closely related tasks that rely on the understanding of the geometric structure of the scene. Existing methods usually handle these two tasks separately, resulting in limited performance. 3. **Complexity of existing model architectures**: Many advanced depth estimation and VO models rely on complex network architectures, such as ResNet, HRNet, etc. Although these models have excellent performance, they are difficult to unify and expand. #### Solutions proposed in the paper To solve the above problems, the author proposes the following solutions: 1. **Two - stage self - supervised learning framework**: - **Pre - training stage**: Use the cross - view completion task (Cross - view Completion Objective, CroCo) for general pre - training to learn 3D geometric structures. CroCo is a method based on masked image modeling, which learns the geometric features of the scene by reconstructing partially occluded images. - **Fine - tuning stage**: Perform self - supervised fine - tuning on unlabeled video data, and use geometric consistency constraints to optimize the performance of the model. 2. **Simplified model architecture**: - Use a standard Transformer architecture (such as ViT) as the backbone network instead of a complex convolutional network. This not only simplifies the model design but also improves the extensibility and robustness of the model. - Introduce modules such as adapters and dense prediction Transformers (DPT) to further improve the performance of the model while maintaining computational efficiency. 3. **Multi - task learning**: - Simultaneously handle depth estimation and visual odometry tasks in the same model, share the encoder and use different decoder branches, thereby improving the overall performance of the model. #### Experimental verification The author conducted experiments on six benchmark datasets (including indoor and outdoor, static and dynamic, real and synthetic image scenes), and the results show that this method outperforms the existing state - of - the - art methods in multiple metrics. Through the above methods, the author has successfully solved the key challenges in monocular depth estimation and visual odometry tasks and provided an efficient and superior - performance solution.