A Review of Monocular Visual-Inertial SLAM
Zhang Guofeng,Huang Gan,Xie Weijian,Chen Danpeng,Wang Nan,Liu Haomin,Bao Hujun
DOI: https://doi.org/10.11834/jig.230863
2024-01-01
Abstract:Monocular visual-inertial simultaneous localization and mapping(VI-SLAM)is an important research topic in computer vision and robotics.It aims to estimate the pose(i.e.,the position and orientation)of the device in real-time using a monocular camera with an inertial sensor while constructing the map of the environment.With the rapid develop-ment of various fields,such as augmented/virtual reality(AR/VR),robotics,and autonomous driving,monocular VI-SLAM has received widespread attention due to its advantages,including low hardware cost and no requirement for an external environment setup,among others.Over the past decade or so,monocular VI-SLAM has made significant progress and spawned many excellent methods and systems.However,because of the complexity of real-world scenarios,different methods have also shown distinct limitations.Although some works have reviewed and evaluated VI-SLAM methods,most of them only focus on classic methods,which cannot fully reflect the latest development status of VI-SLAM technology.Based on optimization type,VI-SLAM can be divided into filtering-and optimization-based methods.Filtering-based meth-ods use filters to fuse observations from visual and inertial sensors,continuously updating the device's state information for localization and mapping.Additionally,depending on whether visual data association(or feature matching)is performed separately,existing methods can be divided into indirect methods(or feature-based methods)and direct methods.Further-more,with the development and widespread application of deep learning technology,researchers have started to incorpo-rate deep learning methods into VI-SLAM to enhance robustness in extreme conditions or perform dense reconstruction.This paper first elaborates on the basic principles of monocular VI-SLAM methods and then classifies them analytically into direct and filtering-,optimization-,feature-,and deep learning-based methods.However,most of the existing datasets and benchmarks are focused on applications like autonomous driving and drones,mainly evaluating pose accuracy.Rela-tively few datasets have been specifically designed for AR.For a more comprehensive comparison of the advantages and dis-advantages of different methods,we select three public datasets to quantitatively evaluate representative monocular VI-SLAM methods from multiple dimensions:the widely used EuRoC dataset,the ZJU-Sensetime dataset suitable for mobile platform AR applications,and the low cost and scalable framework to build localization benchmark(LSFB)dataset aimed at large-scale AR scenarios.Then,we supplemented the ZJU-Sensetime dataset with a more challenging set of sequences called sequences C to enhance the variety of data types and evaluation dimensions.This extended dataset is designed to evaluate the robustness of algorithms under extreme conditions such as pure rotation,planar motion,lighting changes,and dynamic scenes.Specifically,sequences C comprise eight sequences,labeled C0-C7.In the C0 sequence,the handheld device moves around a room,performing multiple pure rotational motions.The C1 sequence involves the device mounted on a stabilized gimbal and moves freely.In the C2 sequence,the device moves in a planar motion,maintaining a constant height.The C3 sequence includes turning lights on and off during recording.In the C4 sequence,the device overlooks the floor while moving.The C5 sequence captures the exterior wall with significant parallax and minimal co-visibility,while the C6 sequence involves viewing a monitor during recording,with slight movement and changing screen content.Finally,the C7 sequence involves long-distance recording.On the EuRoC dataset,both filtering-and optimization-based VI-SLAM methods achieved good accuracy.Multi-state constraint Kalman filter(MSCKF),an early filtering-based system,showed lower accuracy and struggled with some sequences.Some methods such OpenVINS and RNIN-VIO enhanced accuracy by adding new features and deep learning-based algorithms,respectively.OKVIS,an early optimization-based system,com-pleted all sequences but with lower accuracy.Other methods such as VINS-Mono,RD-VIO,and ORB-SLAM3 achieved significant optimizations,improving initialization,robustness,and overall accuracy.Direct methods such as DM-VIO and SVO-Pro,which we extended from DSO and SVO,respectively,showed significant improvements in accuracy through tech-niques like delayed marginalization and efficient use of texture information.Adaptive VIO,which is based on deep learn-ing,achieved high accuracy by continuously updating through online learning,demonstrating adaptability to new sce-narios.Furthermore,on the ZJU-Sensetime dataset,the comparison results of different methods are largely similar to those in EuRoC.The main difference is that the accuracy of the direct method DM-VIO significantly decreases when using a roll-ing shutter camera,whereas the semidirect method SVO-Pro has a slightly better performance.Feature-based methods do not show a significant drop in accuracy,but the smaller field of view(FoV)found in phone cameras reduces the robustness of ORB-SLAM3,Kimera,and MSCKF.Additionally,ORB-SLAM3 has high tracking accuracy but a lower completeness,while Kimera and MSCKF show increased tracking errors.HybVIO,RNIN-VIO,and RD-VIO have the highest accuracy,while HybVIO slightly outperforms the two others.The deep learning-based Adaptive VIO also shows a significant drop in accuracy and struggles to complete sequences B and C,indicating generalization and robustness issues in complex sce-narios.On the LSFB dataset,the comparison results are consistent with those in small-scale datasets.The methods with the highest accuracy in small scenes,such as RNIN-VIO,HybVIO,and RD-VIO,continue to show high accuracy in large scenes.In particular,RNIN-VIO demonstrates even more significant accuracy advantages in large scenes.In large-scale scenes,many feature points are distant and lack parallax,leading to rapid accumulation of errors,especially in methods that are heavily rely on visual constraints.The neural inertial network-based RNIN-VIO can better maximize IMU observa-tions,reducing dependence on visual data.The VINS-Mono also shows significant advantages in large scenes,as its slid-ing window optimization facilitating the early inclusion of small-parallax feature points,effectively controlling error accumu-lation.In contrast,ORB-SLAM3,which relies on local maps,requires sufficient parallax before adding feature points to the local map,which can lead to insufficient visual constraints in distant environments and ultimately cause error accumula-tion and even tracking loss.The experimental results also show that optimization-based or combined filtering-optimization methods generally outperform filtering-based methods in terms of tracking accuracy and robustness.At the same time,direct/semidirect methods perform well when shooting with a global shutter camera,but are prone to error accumulation,especially in large scenes when affected by rolling shutter and light changes.Combining deep learning can improve robust-ness in extreme situations.Finally,the development trend of SLAM is discussed and prospected in this work based on three research hotspots:combining deep learning with V-SLAMNI-SLAM,multisensor fusion,and end-cloud collaboration.