MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements

Lisong C. Sun,Neel P. Bhatt,Jonathan C. Liu,Zhiwen Fan,Zhangyang Wang,Todd E. Humphreys,Ufuk Topcu
2024-04-01
Abstract:Simultaneous localization and mapping is essential for position tracking and scene understanding. 3D Gaussian-based map representations enable photorealistic reconstruction and real-time rendering of scenes using multiple posed cameras. We show for the first time that using 3D Gaussians for map representation with unposed camera images and inertial measurements can enable accurate SLAM. Our method, MM3DGS, addresses the limitations of prior neural radiance field-based representations by enabling faster rendering, scale awareness, and improved trajectory tracking. Our framework enables keyframe-based mapping and tracking utilizing loss functions that incorporate relative pose transformations from pre-integrated inertial measurements, depth estimates, and measures of photometric rendering quality. We also release a multi-modal dataset, UT-MM, collected from a mobile robot equipped with a camera and an inertial measurement unit. Experimental evaluation on several scenes from the dataset shows that MM3DGS achieves 3x improvement in tracking and 5% improvement in photometric rendering quality compared to the current 3DGS SLAM state-of-the-art, while allowing real-time rendering of a high-resolution dense 3D map. Project Webpage:
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use multi - modal sensor data (vision, depth, and inertial measurement) to achieve efficient, real - time, and high - quality 3D scene reconstruction and camera pose tracking in the Simultaneous Localization and Mapping (SLAM) task. Specifically, the paper proposes a new method named MM3DGS, aiming to improve the existing SLAM techniques through the following aspects: 1. **Multi - modal data fusion**: Combine the data of monocular or RGB - D camera images and Inertial Measurement Unit (IMU) to improve the robustness and accuracy of the SLAM system. This solves the limitations that may be encountered when only using visual or depth information, such as motion blur, exposure change, and long - distance depth estimation error. 2. **3D Gaussian representation**: Use 3D Gaussian distribution to represent the scene. This method not only supports real - time rendering but also can provide more accurate scale perception and trajectory alignment ability. Compared with the traditional sparse point cloud or Neural Radiance Field (NeRF) representation, the 3D Gaussian representation significantly reduces the computational cost while maintaining high visual quality. 3. **Real - time performance**: By optimizing key - frame selection, Gaussian initialization, and mapping process, ensure that the system can achieve real - time processing on consumer - level hardware. This is crucial for application scenarios that require a quick response, such as drone mapping, augmented reality, and autonomous mobile robots. 4. **Dataset release**: To verify the effectiveness of the proposed method, the author also releases a new multi - modal dataset (UT - MM), which contains RGB images, depth maps, IMU measurements, and LiDAR point cloud data of multiple indoor scenes collected from mobile robots. This dataset provides a valuable resource for subsequent research. In summary, the main goal of this paper is to develop an efficient, real - time, and high - quality SLAM solution by introducing multi - modal sensor data and 3D Gaussian representation, thereby enhancing the performance of existing techniques in complex environments.