Query-based Semantic Gaussian Field for Scene Representation in Reinforcement Learning

Jiaxu Wang,Ziyi Zhang,Qiang Zhang,Jia Li,Jingkai Sun,Mingyuan Sun,Junhao He,Renjing Xu
2024-09-27
Abstract:Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rendering. Moreover, they lack fine-grained semantic information included in their scene representation vectors because they evenly consider free and occupied spaces. Both of them can destroy the performance of downstream RL tasks. To address the above challenges, we propose a novel framework that adopts the efficient 3D Gaussian Splatting (3DGS) to learn 3D scene representation for the first time. In brief, we present the Query-based Generalizable 3DGS to bridge the 3DGS technique and scene representations with more geometrical awareness than those in NeRFs. Moreover, we present the Hierarchical Semantics Encoding to ground the fine-grained semantic features to 3D Gaussians and further distilled to the scene representation vectors. We conduct extensive experiments on two RL platforms including Maniskill2 and Robomimic across 10 different tasks. The results show that our method outperforms the other 5 baselines by a large margin. We achieve the best success rates on 8 tasks and the second-best on the other two tasks.
Robotics
What problem does this paper attempt to address?
The paper attempts to address the problem of how to efficiently represent 3D scenes in vision-based Reinforcement Learning (RL) and extract compact scene representations with geometric and semantic information. Specifically: 1. **Limitations of existing methods**: Current NeRF-based methods are inefficient in handling 3D structural information and struggle to effectively utilize 3D geometric priors from RGB-D observations. These methods typically require dense sampling to render 3D scenes, resulting in low data efficiency and slow training speeds. 2. **Proposed new framework**: The paper proposes a new framework that leverages efficient 3D Gaussian Splatting (3DGS) technology to learn scene representations with geometric awareness. By introducing Hierarchical Semantics Encoding (HSE), the semantic details of the scene representation are further enhanced. 3. **Main contributions**: - For the first time, the 3DGS framework is used to learn scene representations with semantic and geometric awareness for vision-based reinforcement learning tasks. - A hierarchical semantics encoding scheme is proposed to guide scene representation learning based on Gaussian language fields. - A Query-based Generalizable Feature Splatting (QGFS) is proposed, which can render scenes from a single latent vector, thereby utilizing efficient 3DGS for representation learning. 4. **Experimental results**: Extensive experiments were conducted on two reinforcement learning platforms, including Maniskill2 and Robomimic, covering 10 different tasks. The experimental results show that the proposed method significantly outperforms five other baseline methods in most tasks, achieving the best success rate in 8 tasks and second place in the other 2 tasks. In summary, the paper aims to improve scene representation methods to enhance the performance of vision-based reinforcement learning algorithms in understanding and operating in real-world 3D environments.