SplatR : Experience Goal Visual Rearrangement with 3D Gaussian Splatting and Dense Feature Matching

Arjun P S,Andrew Melnik,Gora Chand Nandi
2024-11-22
Abstract:Experience Goal Visual Rearrangement task stands as a foundational challenge within Embodied AI, requiring an agent to construct a robust world model that accurately captures the goal state. The agent uses this world model to restore a shuffled scene to its original configuration, making an accurate representation of the world essential for successfully completing the task. In this work, we present a novel framework that leverages on 3D Gaussian Splatting as a 3D scene representation for experience goal visual rearrangement task. Recent advances in volumetric scene representation like 3D Gaussian Splatting, offer fast rendering of high quality and photo-realistic novel views. Our approach enables the agent to have consistent views of the current and the goal setting of the rearrangement task, which enables the agent to directly compare the goal state and the shuffled state of the world in image space. To compare these views, we propose to use a dense feature matching method with visual features extracted from a foundation model, leveraging its advantages of a more universal feature representation, which facilitates robustness, and generalization. We validate our approach on the AI2-THOR rearrangement challenge benchmark and demonstrate improvements over the current state of the art methods
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenges in the **Experience Goal Visual Rearrangement Task**. Specifically, the research aims to develop a method that can accurately capture and restore the object configurations in a scene, enabling the agent to restore the environment to its initial target state after it has been disrupted. ### Problem Background In the field of Embodied AI, the visual rearrangement task requires the agent to be able to navigate in a complex environment, recognize the current state, and restore the environment to the specified target state through operations. For the "experience - based goal" setting, the agent is first placed in a known target state to learn and construct a world model of this target state. Then, the agent will be re - initialized in the same disrupted environment and needs to rearrange the objects according to the previously learned target state. ### Main Challenges 1. **Effectively Representing the Target State**: The agent needs to construct an effective world model when first exposed to the target state in order to accurately represent the positions and postures of all objects. 2. **Comparing the Current and Target States**: The agent must be able to effectively compare the current disordered state with the target state in memory and identify which objects have changed. 3. **Robustness and Generalization Ability**: The method needs to be robust enough to handle various complex situations in the real world and perform well on different datasets. ### Solutions To solve the above problems, the paper proposes the SplatR framework, and the main innovations include: - **Using 3D Gaussian Splatting as Scene Representation**: This method can quickly render high - quality, realistic new perspectives, provide continuous scene representation, and retain rich visual information at the same time. - **Dense Feature Matching**: By performing patch - by - patch matching of local features extracted from base models (such as DINOv2), semantic consistency between images is ensured, thereby detecting scene changes more accurately. - **Category - Agnostic Object Matching**: Visual embeddings are used for object - level matching, avoiding possible errors that may be introduced by traditional classification methods. Through these techniques, SplatR can perform excellently in the AI2 - THOR benchmark test, especially outperforming existing methods in metrics such as % Fixed Strict, % Misplaced, and % Energy Remaining. ### Summary The core contribution of this paper lies in proposing a novel framework that uses 3D Gaussian Splatting and Dense Feature Matching to solve the experience - goal - based visual rearrangement task, significantly improving the agent's rearrangement ability in complex environments.