Abstract:Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Depth and motion networks work collaboratively to faithfully model the geometry and dynamics of real-world scenes, which, in turn, benefits both depth and 3D motion estimation. Their predictions are further combined to synthesize a novel video frame for self-supervised training. As a core component of our framework, DO3D is a new motion disentanglement module that learns to predict camera ego-motion and instance-aware 3D object motion separately. To alleviate the difficulties in estimating non-rigid 3D object motions, they are decomposed to object-wise 6-DoF global transformations and a pixel-wise local 3D motion deformation field. Qualitative and quantitative experiments are conducted on three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our model delivers superior performance in all evaluated settings. For the depth estimation task, our model outperforms all compared research works in the high-resolution setting, attaining an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results (an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and largely improve the estimation of dynamic regions, demonstrating the effectiveness of our motion model. Our code will be available.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in estimating the 3D motion and depth of objects from monocular videos. Specifically, existing self - supervised depth estimation methods usually regard all objects in the video as static entities, which violates the dynamic characteristics of real - world scenes and fails to accurately model the geometry and motion of moving objects. Therefore, these methods perform poorly when dealing with dynamic objects. To solve these problems, the authors propose a new self - supervised learning framework DO3D (Decomposed Object - aware 3D Motion and Depth), whose main objectives include: 1. **Jointly learning 3D motion and depth**: By introducing a new decomposed object - aware 3D motion estimation module (DO3D), the model can predict camera ego - motion and the 3D motion of objects. The depth estimation network and the motion estimation network work together to more faithfully model the geometry and dynamic characteristics of the real world. 2. **Decomposing 3D object motion**: To alleviate the difficulties in non - rigid 3D object motion estimation, the DO3D module decomposes object motion into a global 6 - DoF (6 - Degree - of - Freedom) transformation and a pixel - level local 3D motion deformation field. This decomposition enables the model to better handle complex non - rigid object motions, such as pedestrians and cyclists. 3. **Improving performance in dynamic scenes**: Through the above improvements, DO3D can achieve better performance on datasets containing more dynamic objects (such as Cityscapes and VKITTI2). In particular, in a high - resolution setting, the model achieves an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark, significantly outperforming other methods. ### Formula Representation To understand the working principle of DO3D more clearly, here are some key formulas involved in the paper: - **Back - projection formula**: \[ d_s p_s = K T_{t \to s} (d_t K^{-1} p_t) \] where \(p_t\) and \(p_s\) are the 2D homogeneous pixel coordinates in the target frame and the source frame respectively, \(d_t\) and \(d_s\) are the corresponding depth values, \(T_{t \to s}\) is the camera extrinsic matrix, and \(K\) is the camera intrinsic matrix. - **Photometric loss function**: \[ L_{\text{ph}}(\hat{I}_t, I_t) = \alpha \frac{1 - \text{SSIM}(\hat{I}_t, I_t)}{2} + (1 - \alpha) \| \hat{I}_t - I_t \|_1 \] where \(\alpha\) is a hyperparameter used to balance the SSIM term and the pixel - level difference. - **Mathematical relationship between depth and pixel coordinates**: \[ u_s = \frac{d_t (u_t - c_x) + f_x t_1}{d_t + t_3} + c_x \] \[ v_s = \frac{d_t (v_t - c_y) + f_y t_2}{d_t + t_3} + c_y \] Through these improvements, DO3D not only improves the accuracy of depth and motion estimation but also performs well in highly dynamic scenes.

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

3D Object Aided Self-Supervised Monocular Depth Estimation

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Unsupervised Video Depth Estimation Based on Ego-motion and Disparity Consensus

Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth

Self-supervised Rigid Object 3-D Motion Estimation from Monocular Video

Unsupervised Learning of Depth Estimation, Camera Motion Prediction and Dynamic Object Localization from Video

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Cycle-SfM: Joint Self-Supervised Learning of Depth and Camera Motion from Monocular Image Sequences.

DiPE: Deeper into Photometric Errors for Unsupervised Learning of Depth and Ego-motion from Monocular Videos

Unsupervised Ego-Motion and Dense Depth Estimation with Monocular Video

Self-Supervised Monocular Depth Estimation With Self-Perceptual Anomaly Handling

Self-Supervised Learning of Depth and Ego-motion for 3D Perception in Human Computer Interaction

Unsupervised Framework for Depth Estimation and Camera Motion Prediction from Video.

Self-supervised Learning of Monocular 3D Geometry Understanding with Two- and Three-View Geometric Constraints

Manydepth2: Motion-Aware Self-Supervised Multi-Frame Monocular Depth Estimation in Dynamic Scenes

Kinematic 3D Object Detection in Monocular Video

Unsupervised Learning of Depth, Optical Flow and Pose With Occlusion From 3D Geometry

Self-Supervised Monocular Depth Estimation With Positional Shift Depth Variance and Adaptive Disparity Quantization

Unsupervised Monocular Depth Perception: Focusing on Moving Objects