DVN-SLAM: Dynamic Visual Neural SLAM Based on Local-Global Encoding

Wenhua Wu,Guangming Wang,Ting Deng,Sebastian Aegidius,Stuart Shanks,Valerio Modugno,Dimitrios Kanoulas,Hesheng Wang
2024-03-18
Abstract:Recent research on Simultaneous Localization and Mapping (SLAM) based on implicit representation has shown promising results in indoor environments. However, there are still some challenges: the limited scene representation capability of implicit encodings, the uncertainty in the rendering process from implicit representations, and the disruption of consistency by dynamic objects. To address these challenges, we propose a real-time dynamic visual SLAM system based on local-global fusion neural implicit representation, named DVN-SLAM. To improve the scene representation capability, we introduce a local-global fusion neural implicit representation that enables the construction of an implicit map while considering both global structure and local details. To tackle uncertainties arising from the rendering process, we design an information concentration loss for optimization, aiming to concentrate scene information on object surfaces. The proposed DVN-SLAM achieves competitive performance in localization and mapping across multiple datasets. More importantly, DVN-SLAM demonstrates robustness in dynamic scenes, a trait that sets it apart from other NeRF-based methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve some challenges faced by Simultaneous Localization and Mapping (SLAM) systems based on implicit representation in indoor environments. Specifically, the author points out the following three main problems: 1. **Limited scene representation ability**: - Existing implicit encoding methods have limitations in representing complex scenes. For example, iMAP [1] uses a neural implicit representation of position coding. Although it can achieve global consistency, it is too smooth in local details and is prone to forgetting details as the scene scale increases. - Methods based on feature grids or planes (such as NICE - SLAM [2] and ESLAM [4]) can accurately model local scene details, but their global representation and prediction abilities decline significantly. 2. **Uncertainty in the rendering process**: - In the volume rendering process of implicit representation, different information distributions along the same view ray may produce the same rendering result, which introduces uncertainty. Even if the rendering error is small, the distribution of scene information may be inaccurate. 3. **Destruction of consistency by dynamic objects**: - The movement of dynamic objects will destroy the static consistency of the scene, making pure - pose implicit mapping insufficient to model dynamic scenes. Existing NeRF - based SLAM methods perform poorly in handling dynamic scenes and are easily affected by dynamic objects, leading to localization and mapping failures. To solve these problems, the author proposes a real - time dynamic visual SLAM system based on local - global fusion neural implicit representation, named DVN - SLAM. The main innovations of this system include: - **Local - global fusion neural implicit representation**: By combining feature fusion and result fusion of the attention mechanism, using the advantages of continuous neural radiation fields for global representation and discrete feature planes for local representation, the scene representation ability is improved. - **Information - concentration loss**: Aiming at the uncertainty in the rendering process, an information - concentration loss based on rendering variance is designed to optimize the distribution of scene information and make it concentrate on the object surface. - **Robustness in dynamic scenes**: DVN - SLAM performs well in dynamic scenes, can automatically ignore fast - moving objects, and effectively restore the background occluded by dynamic objects. These improvements make DVN - SLAM competitive not only in static scenes but also able to maintain effective localization and mapping performance in highly dynamic scenes.