Abstract:For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the **simultaneous estimation problems of monocular depth estimation and visual odometry (VO)**. Specifically, the author proposes a Transformer model based on self - supervised learning to simultaneously estimate the depth information of the scene and camera motion without labeled data. #### Main problem background 1. **High cost of data labeling**: Traditional supervised learning methods require a large amount of labeled data for training, which is very expensive and time - consuming in practical applications. 2. **Complexity of geometric tasks**: Depth estimation and visual odometry are two closely related tasks that rely on the understanding of the geometric structure of the scene. Existing methods usually handle these two tasks separately, resulting in limited performance. 3. **Complexity of existing model architectures**: Many advanced depth estimation and VO models rely on complex network architectures, such as ResNet, HRNet, etc. Although these models have excellent performance, they are difficult to unify and expand. #### Solutions proposed in the paper To solve the above problems, the author proposes the following solutions: 1. **Two - stage self - supervised learning framework**: - **Pre - training stage**: Use the cross - view completion task (Cross - view Completion Objective, CroCo) for general pre - training to learn 3D geometric structures. CroCo is a method based on masked image modeling, which learns the geometric features of the scene by reconstructing partially occluded images. - **Fine - tuning stage**: Perform self - supervised fine - tuning on unlabeled video data, and use geometric consistency constraints to optimize the performance of the model. 2. **Simplified model architecture**: - Use a standard Transformer architecture (such as ViT) as the backbone network instead of a complex convolutional network. This not only simplifies the model design but also improves the extensibility and robustness of the model. - Introduce modules such as adapters and dense prediction Transformers (DPT) to further improve the performance of the model while maintaining computational efficiency. 3. **Multi - task learning**: - Simultaneously handle depth estimation and visual odometry tasks in the same model, share the encoder and use different decoder branches, thereby improving the overall performance of the model. #### Experimental verification The author conducted experiments on six benchmark datasets (including indoor and outdoor, static and dynamic, real and synthetic image scenes), and the results show that this method outperforms the existing state - of - the - art methods in multiple metrics. Through the above methods, the author has successfully solved the key challenges in monocular depth estimation and visual odometry tasks and provided an efficient and superior - performance solution.

Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Monocular Depth Estimation Based on Unsupervised Learning

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

Digging Into Self-Supervised Monocular Depth Estimation

Complete contextual information extraction for self-supervised monocular depth estimation

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Self-Supervised Learning based Depth Estimation from Monocular Images

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks

Improving Monocular Visual Odometry Using Learned Depth

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

Unsupervised Monocular Depth Learning in Dynamic Scenes

Unsupervised Deep Persistent Monocular Visual Odometry and Depth Estimation in Extreme Environments

RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes

Self-Supervised Geometry-Guided Initialization for Robust Monocular Visual Odometry

SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling