Abstract:Although cluttered indoor scenes have a lot of useful high-level semantic information which can be used for mapping and localization, most visual odometry (VO) algorithms rely on the usage of geometric features such as points, lines, and planes. Lately, driven by this idea, the joint optimization of semantic labels and estimating odometry has gained popularity in the robotics community. This joint optimization method is accurate but is generally very slow. At the same time, in the vision community, direct and sparse approaches for VO have stricken the right balance between speed and accuracy. We merge the successes of these two communities and present a preprocessing method to incorporate semantic information in the form of visual saliency to direct sparse odometry (DSO)—a highly successful direct sparse VO algorithm. We also present a framework to filter the visual saliency based on scene parsing. Our framework SalientDSO relies on the widely successful deep learning-based approaches for visual saliency and scene parsing, which drives the feature selection for obtaining highly accurate and robust VO even in the presence of as few as 40 point features per frame. We provide an extensive quantitative evaluation of SalientDSO on the ICL-NUIM and the TUM monoVO data sets and show that we outperform DSO and ORB-simultaneous localization and mapping—two very popular state-of-the-art approaches in the literature. We also collect and publicly release a CVL-UMD data set which contains two indoor cluttered sequences on which we show qualitative evaluations. To the best of our knowledge, this is the first paper to use visual saliency and scene parsing to drive the feature selection in direct VO. Note to Practitioners—The algorithm of estimating the camera motion from a set of moving camera frames/images is commonly called VO. This problem has many applications ranging from building a 3-D map of the scene for the robot to navigate, grasp, and so on. Any VO algorithm must be fast, robust, and with low drift (low accumulation in error). These desired functions are generally obtained by selecting “good” features in an image, which, in the computer vision sense, turns out to be “corners.” However, when we constrain the setting to an indoor scene with a lot of clutter, we have a lot of objects which can be used to obtain “good” features from both a computer vision sense and a conceptual sense. We use this philosophy and present a preprocessing method to select better features as compared to a traditional VO pipeline using only geometric features and improve the robustness of the state-of-the-art VO method: direct sparse odometry, obtaining more accurate and robust results even with the lesser number of features. We evaluate our methods on three different data sets: ICL-NUIM, TUM monoVO, and CVL-UMD. We collected a custom dataset we call CVL-UMD to demonstrate the robustness of our approach, namely, SalientDSO in cluttered indoor scenes.

Salient Sparse Visual Odometry With Pose-Only Supervision

Self-supervised Visual-LiDAR Odometry with Flip Consistency

Design of an Enhanced Visual Odometry by Building and Matching Compressive Panoramic Landmarks Online

PALVO: Visual Odometry Based on Panoramic Annular Lens.

PVO: Panoptic Visual Odometry.

Learning Generalized Visual Odometry Using Position-Aware Optical Flow and Geometric Bundle Adjustment

A self-supervised monocular odometry with visual-inertial and depth representations

Unsupervised Monocular Visual-Inertial Odometry Network

XVO: Generalized Visual Odometry via Cross-Modal Self-Training

Self-Improving Visual Odometry

DF-VO: What Should Be Learnt for Visual Odometry?

DeepAVO: Efficient Pose Refining with Feature Distilling for Deep Visual Odometry

Pose Refinement: Bridging the Gap Between Unsupervised Learning and Geometric Methods for Visual Odometry.

Self-Supervised Deep Visual Odometry with Online Adaptation

MAS-DSO: Advancing Direct Sparse Odometry with Multi-Attention Saliency

SalientDSO: Bringing Attention to Direct Sparse Odometry

Learning By Analogy: Reliable Supervision From Transformations For Unsupervised Optical Flow Estimation

Robust and Efficient Visual-Inertial Odometry with Multi-plane Priors.

Visual Odometry Based On Semantic Supervision

Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World

Leveraging Deep Learning for Visual Odometry Using Optical Flow