DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

Xiuzhe Wu,Xiaoyang Lyu,Qihao Huang,Yong Liu,Yang Wu,Ying Shan,Xiaojuan Qi
2024-03-09
Abstract:Although considerable advancements have been attained in self-supervised depth estimation from monocular videos, most existing methods often treat all objects in a video as static entities, which however violates the dynamic nature of real-world scenes and fails to model the geometry and motion of moving objects. In this paper, we propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Depth and motion networks work collaboratively to faithfully model the geometry and dynamics of real-world scenes, which, in turn, benefits both depth and 3D motion estimation. Their predictions are further combined to synthesize a novel video frame for self-supervised training. As a core component of our framework, DO3D is a new motion disentanglement module that learns to predict camera ego-motion and instance-aware 3D object motion separately. To alleviate the difficulties in estimating non-rigid 3D object motions, they are decomposed to object-wise 6-DoF global transformations and a pixel-wise local 3D motion deformation field. Qualitative and quantitative experiments are conducted on three benchmark datasets, including KITTI, Cityscapes, and VKITTI2, where our model delivers superior performance in all evaluated settings. For the depth estimation task, our model outperforms all compared research works in the high-resolution setting, attaining an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark. Besides, our optical flow estimation results (an overall EPE of 7.09 on KITTI) also surpass state-of-the-art methods and largely improve the estimation of dynamic regions, demonstrating the effectiveness of our motion model. Our code will be available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in estimating the 3D motion and depth of objects from monocular videos. Specifically, existing self - supervised depth estimation methods usually regard all objects in the video as static entities, which violates the dynamic characteristics of real - world scenes and fails to accurately model the geometry and motion of moving objects. Therefore, these methods perform poorly when dealing with dynamic objects. To solve these problems, the authors propose a new self - supervised learning framework DO3D (Decomposed Object - aware 3D Motion and Depth), whose main objectives include: 1. **Jointly learning 3D motion and depth**: By introducing a new decomposed object - aware 3D motion estimation module (DO3D), the model can predict camera ego - motion and the 3D motion of objects. The depth estimation network and the motion estimation network work together to more faithfully model the geometry and dynamic characteristics of the real world. 2. **Decomposing 3D object motion**: To alleviate the difficulties in non - rigid 3D object motion estimation, the DO3D module decomposes object motion into a global 6 - DoF (6 - Degree - of - Freedom) transformation and a pixel - level local 3D motion deformation field. This decomposition enables the model to better handle complex non - rigid object motions, such as pedestrians and cyclists. 3. **Improving performance in dynamic scenes**: Through the above improvements, DO3D can achieve better performance on datasets containing more dynamic objects (such as Cityscapes and VKITTI2). In particular, in a high - resolution setting, the model achieves an absolute relative depth error (abs rel) of 0.099 on the KITTI benchmark, significantly outperforming other methods. ### Formula Representation To understand the working principle of DO3D more clearly, here are some key formulas involved in the paper: - **Back - projection formula**: \[ d_s p_s = K T_{t \to s} (d_t K^{-1} p_t) \] where \(p_t\) and \(p_s\) are the 2D homogeneous pixel coordinates in the target frame and the source frame respectively, \(d_t\) and \(d_s\) are the corresponding depth values, \(T_{t \to s}\) is the camera extrinsic matrix, and \(K\) is the camera intrinsic matrix. - **Photometric loss function**: \[ L_{\text{ph}}(\hat{I}_t, I_t) = \alpha \frac{1 - \text{SSIM}(\hat{I}_t, I_t)}{2} + (1 - \alpha) \| \hat{I}_t - I_t \|_1 \] where \(\alpha\) is a hyperparameter used to balance the SSIM term and the pixel - level difference. - **Mathematical relationship between depth and pixel coordinates**: \[ u_s = \frac{d_t (u_t - c_x) + f_x t_1}{d_t + t_3} + c_x \] \[ v_s = \frac{d_t (v_t - c_y) + f_y t_2}{d_t + t_3} + c_y \] Through these improvements, DO3D not only improves the accuracy of depth and motion estimation but also performs well in highly dynamic scenes.