Abstract:Estimating geometric elements such as depth, camera motion, and optical flow from images is an important part of the robot's visual perception. We use a joint self-supervised method to estimate the three geometric elements. Depth network, optical flow network and camera motion network are independent of each other but are jointly optimized during training phase. Compared with independent training, joint training can make full use of the geometric relationship between geometric elements and provide dynamic and static information of the scene. In this paper, we improve the joint self-supervision method from three aspects: network structure, dynamic object segmentation, and geometric constraints. In terms of network structure, we apply the attention mechanism to the camera motion network, which helps to take advantage of the similarity of camera movement between frames. And according to attention mechanism in Transformer, we propose a plug-and-play convolutional attention module. In terms of dynamic object, according to the different influences of dynamic objects in the optical flow self-supervised framework and the depth-pose self-supervised framework, we propose a threshold algorithm to detect dynamic regions, and mask that in the loss function respectively. In terms of geometric constraints, we use traditional methods to estimate the fundamental matrix from the corresponding points to constrain the camera motion network. We demonstrate the effectiveness of our method on the KITTI dataset. Compared with other joint self-supervised methods, our method achieves state-of-the-art performance in the estimation of pose and optical flow, and the depth estimation has also achieved competitive results. Code will be available <a class="link-external link-https" href="https://github.com/jianfenglihg/Unsupervised_geometry" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to jointly learn depth, optical flow and ego - motion in an unsupervised manner from videos. Specifically, the paper proposes a new self - supervised method, aiming to improve the estimation accuracy of depth, optical flow and camera motion by optimizing the network structure, dynamic object segmentation and geometric constraints. This method can fully utilize the relationships between geometric elements and provide dynamic and static information of the scene, thus playing an important role in robotic visual perception. ### Main Contributions 1. **Introduction of Attention Mechanism**: An attention mechanism is introduced into the pose - estimation network, and a convolutional attention module is proposed to explore the continuity and similarity of inter - frame motion, thereby improving the accuracy of pose estimation. 2. **New Geometric Constraints**: The traditional eight - point method is used to establish geometric constraints between optical flow and pose, improving the estimation accuracy. 3. **Optical Flow Direction Consistency Constraint**: A new optical flow direction consistency constraint is proposed, effectively improving the accuracy of optical flow prediction. 4. **Dynamic Object Detection Method**: A method for detecting moving objects is proposed, effectively removing the influence of moving objects from the loss function. ### Method Overview The overall method in the paper is shown in Figure 1 and mainly relies on three independent neural networks: DepthNet, FlowNet and PoseNet. During the training phase, three adjacent frames of images \(I_{t - 1}, I_t, I_{t + 1}\) are used to obtain the estimation results of depth, optical flow and pose respectively. Then various types of masks are calculated, inappropriate areas are eliminated, and unsupervised training is carried out through the loss function. ### Loss Function Four types of loss functions are defined in the paper: - **Photometric Reprojection Loss \(L_{ph}\)**: Used to measure the image reconstruction error. - **Smoothness Loss \(L_s\)**: Used to regularize the smoothness of geometric entities. - **Consistency Loss \(L_c\)**: Including the consistency loss of the depth map, the consistency loss between depth and optical flow, and the consistency loss of forward and backward optical flow. - **Geometric Loss \(L_g\)**: Based on the difference between the fundamental matrix estimated by the eight - point method and the fundamental matrix calculated by the pose. The total loss function is: \[L=\lambda_d^{ph}M_vM_oM_dL_d^{ph}+\lambda_f^{ph}M_vM_oL_f^{ph}+\lambda_d^cM_vM_oM_dL_d^c+\lambda_f^cM_dL_f^c+\lambda_{df}^cM_vM_oM_dL_{df}^c+\lambda_d^sL_d^s+\lambda_f^sL_f^s+\lambda_gM_vM_oM_dL_g\] ### Geometric and Appearance Basis - **Pixel Correspondence**: Pixel correspondences between image frames are established through geometric elements estimated by the network, and images are reconstructed by interpolation. - **3D Scene Point Back - Projection**: Given a pixel \(p\) in image \(I_t\), it can be back - projected to a 3D scene point \(P_t\) in the camera coordinate system: \[P_t = D_t(p)K^{-1}p\] - **Coordinate Transformation**: According to the camera motion \(R_f, t_f\), the 3D scene point is transformed from time \(t\) to time \(t + 1\) and projected onto image \(I_{t + 1}\): \[p_{t+1}^{d}=K[R_f|t_f]D_t(p)K^{-1}p\] - **Optical Flow Calculation**: For dynamic objects, optical flow \(F_f(p)\) is used to establish pixel correspondences: \[p_{t+1}^f = p+F_f(p)\] ### Masks - **Validity Mask \(M_v\)**: Analytically calculated from depth and ego - motion estimates. - **Occlusion Mask \(M_o\)**: Detects occluded areas based on the difference in forward and backward reconstruction errors.

Unsupervised Joint Learning of Depth, Optical Flow, Ego-motion from Video

Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity.

Joint Unsupervised Learning of Depth, Pose, Ground Normal Vector and Ground Segmentation by a Monocular Camera Sensor.

GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose

Cycle-SfM: Joint Self-Supervised Learning of Depth and Camera Motion from Monocular Image Sequences.

Unsupervised Learning of Depth and Ego-Motion with Spatial-Temporal Geometric Constraints

Unsupervised Learning of Depth, Optical Flow and Pose With Occlusion From 3D Geometry

Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding

Self-supervised Learning of Monocular 3D Geometry Understanding with Two- and Three-View Geometric Constraints

Joint Self-Supervised Learning of Interest Point, Descriptor, Depth, and Ego-Motion from Monocular Video

Joint Unsupervised Learning of Optical Flow and Egomotion with Bi-Level Optimization

Joint Self-supervised Depth and Optical Flow Estimation towards Dynamic Objects

Feature-Level Collaboration: Joint Unsupervised Learning of Optical Flow, Stereo Depth and Camera Motion.

Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Unsupervised Full Transformer for Pose, Depth and Optical Flow Joint Learning

Self-Supervised Learning of Depth and Ego-Motion from Videos by Alternative Training and Geometric Constraints from 3-D to 2-D

Geometry-Aware Network for Unsupervised Learning of Monocular Camera's Ego-Motion

Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding

Unsupervised Ego-Motion and Dense Depth Estimation with Monocular Video

A Unified Unsupervised Learning Framework for Stereo Matching and Ego-Motion Estimation

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video