Self-Supervised Learning of Depth and Ego-Motion from Videos by Alternative Training and Geometric Constraints from 3-D to 2-D

Jiaojiao Fang,Guizhong Liu
DOI: https://doi.org/10.1109/tcds.2022.3152241
IF: 4.546
2022-01-01
IEEE Transactions on Cognitive and Developmental Systems
Abstract:Self-supervised learning of depth and ego-motion from unlabeled monocular videos has acquired promising results and drawn extensive attention. Most of the existing methods jointly train the depth and pose networks by photometric consistency of adjacent views based on the principle of structure-from-motion (SFM). However, the coupled relationship of the depth and pose networks based on the scene reprojection seriously influences the learning performance due to the scale ambiguity of image reconstruction-based geometry learning or the error accumulation between the learning-based method and multiview geometry-based method. In this article, we aim to improve the performance of depth and pose estimation without the auxiliary tasks and reduce the influence of the above problems on algorithm performance by alternatively training each task and geometric constraints from 3-D to 2-D. Distinct from jointly training the depth and pose networks, our key idea is to better utilize the mutual dependency between two tasks by alternatively training each network with respective geometric constraints while fixing the other. To make the optimization process easier, the iterative closest point (ICP)-based 3-D structural consistency-embedded epipolar geometric constraints are further introduced into depth and pose networks learning, which can take full advantage of both geometric methods. Then, a log-scale 3-D structural consistency loss is designed to put more emphasis on the smaller depth values during training. Extensive experiments on various benchmark data sets indicate the superiority of our algorithm over the state-of-the-art self-supervised methods.
What problem does this paper attempt to address?