Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor

Xinyang Liu,Yijin Li,Yanbin Teng,Hujun Bao,Guofeng Zhang,Yinda Zhang,Zhaopeng Cui
DOI: https://doi.org/10.48550/arXiv.2308.14383
2023-08-28
Abstract:Light-weight time-of-flight (ToF) depth sensors are compact and cost-efficient, and thus widely used on mobile devices for tasks such as autofocus and obstacle detection. However, due to the sparse and noisy depth measurements, these sensors have rarely been considered for dense geometry reconstruction. In this work, we present the first dense SLAM system with a monocular camera and a light-weight ToF sensor. Specifically, we propose a multi-modal implicit scene representation that supports rendering both the signals from the RGB camera and light-weight ToF sensor which drives the optimization by comparing with the raw sensor inputs. Moreover, in order to guarantee successful pose tracking and reconstruction, we exploit a predicted depth as an intermediate supervision and develop a coarse-to-fine optimization strategy for efficient learning of the implicit representation. At last, the temporal information is explicitly exploited to deal with the noisy signals from light-weight ToF sensors to improve the accuracy and robustness of the system. Experiments demonstrate that our system well exploits the signals of light-weight ToF sensors and achieves competitive results both on camera tracking and dense scene reconstruction. Project page: \url{<a class="link-external link-https" href="https://zju3dv.github.io/tof_slam/" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges encountered in performing Dense Simultaneous Localization and Mapping (Dense SLAM) using lightweight Time - of - Flight (ToF) sensors and monocular cameras. Specifically, due to their compact design, lightweight ToF sensors can only provide low - resolution depth distribution measurements, which makes existing RGB - D dense SLAM systems unable to directly utilize these signals. These problems include: 1. **Low - resolution depth signal**: The depth signals provided by lightweight ToF sensors are very sparse and noisy, and it is difficult to directly use them for high - precision scene reconstruction and camera pose estimation. 2. **Limitations of existing systems**: Existing RGB - D SLAM systems usually require high - resolution depth images as input, and lightweight ToF sensors cannot provide such data. 3. **Multi - modal signal fusion**: How to effectively fuse the RGB images of the monocular camera and the depth signals of the lightweight ToF sensors to achieve accurate camera tracking and dense scene reconstruction. To solve these problems, the paper proposes a new multi - modal implicit scene representation method, which can process RGB images and signals from lightweight ToF sensors simultaneously, and is achieved through the following technical means: - **Multi - modal implicit scene representation**: A multi - modal implicit scene representation is designed to support the rendering of RGB images and signals from lightweight ToF sensors. - **Predicted depth as intermediate supervision**: A depth prediction model is used to generate high - resolution depth maps as intermediate supervision to improve the robustness and accuracy of the system. - **Coarse - to - fine optimization strategy**: A coarse - to - fine optimization strategy is adopted. First, the region - level ToF signals are used for rough optimization, and then pixel - level RGB/depth supervision is added to restore geometric details. - **Temporal filtering technique**: A temporal filtering technique is developed. By fusing historical observations and current observations, the depth prediction module is enhanced, especially when the quality of the original ToF signals is poor. Through these technical means, the paper successfully achieves the goal of performing dense SLAM using only a monocular camera and a lightweight ToF sensor, and demonstrates its competitiveness in camera tracking and scene reconstruction in experiments.