Joint Self-Supervised Learning of Interest Point, Descriptor, Depth, and Ego-Motion from Monocular Video
Zhongyi Wang,Mengjiao Shen,Qijun Chen
DOI: https://doi.org/10.1007/s11042-024-18382-x
IF: 2.577
2024-01-01
Multimedia Tools and Applications
Abstract:This paper addresses the self-supervised learning of several critical factors in Visual Simultaneous Localization and Mapping (VSLAM) in low-level vision: interest point learning, descriptor learning, ego-motion estimation, and depth estimation. The key insight we have is that appearance and geometry constraints can be used to couple these fundamental vision issues. We propose a self-supervised framework for joint training of neural networks for multiple objectives to address complicated issues, simplify systems, and provide important information for deep monocular VSLAM systems. First, we input two adjacent images into pose and depth networks to obtain their corresponding depth maps and camera poses. Then, we employ a differentiable geometry module and utilize the depth maps and camera poses to generate pseudo-input images needed for the interest point network and construct the geometry loss. Further, we input the pseudo-input image and source image into the interest point network to obtain the corresponding interest points, descriptors, and scores. Subsequently, we construct the appearance loss. Finally, we combine the geometry and appearance losses to constrain the whole network in an unsupervised manner. The novelty of this paper is that it integrates the key information necessary in monocular VSLAM into a unified framework that takes into account interest point learning, descriptor learning, ego-motion estimation, and depth estimation at the same time. Without providing any ground truth, our model can combine sub-problems for self-supervised learning and achieve state-of-the-art performance in their respective domains.