Abstract:Latent scene representation plays a significant role in training reinforcement learning (RL) agents. To obtain good latent vectors describing the scenes, recent works incorporate the 3D-aware latent-conditioned NeRF pipeline into scene representation learning. However, these NeRF-related methods struggle to perceive 3D structural information due to the inefficient dense sampling in volumetric rendering. Moreover, they lack fine-grained semantic information included in their scene representation vectors because they evenly consider free and occupied spaces. Both of them can destroy the performance of downstream RL tasks. To address the above challenges, we propose a novel framework that adopts the efficient 3D Gaussian Splatting (3DGS) to learn 3D scene representation for the first time. In brief, we present the Query-based Generalizable 3DGS to bridge the 3DGS technique and scene representations with more geometrical awareness than those in NeRFs. Moreover, we present the Hierarchical Semantics Encoding to ground the fine-grained semantic features to 3D Gaussians and further distilled to the scene representation vectors. We conduct extensive experiments on two RL platforms including Maniskill2 and Robomimic across 10 different tasks. The results show that our method outperforms the other 5 baselines by a large margin. We achieve the best success rates on 8 tasks and the second-best on the other two tasks.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to efficiently represent 3D scenes in vision-based Reinforcement Learning (RL) and extract compact scene representations with geometric and semantic information. Specifically: 1. **Limitations of existing methods**: Current NeRF-based methods are inefficient in handling 3D structural information and struggle to effectively utilize 3D geometric priors from RGB-D observations. These methods typically require dense sampling to render 3D scenes, resulting in low data efficiency and slow training speeds. 2. **Proposed new framework**: The paper proposes a new framework that leverages efficient 3D Gaussian Splatting (3DGS) technology to learn scene representations with geometric awareness. By introducing Hierarchical Semantics Encoding (HSE), the semantic details of the scene representation are further enhanced. 3. **Main contributions**: - For the first time, the 3DGS framework is used to learn scene representations with semantic and geometric awareness for vision-based reinforcement learning tasks. - A hierarchical semantics encoding scheme is proposed to guide scene representation learning based on Gaussian language fields. - A Query-based Generalizable Feature Splatting (QGFS) is proposed, which can render scenes from a single latent vector, thereby utilizing efficient 3DGS for representation learning. 4. **Experimental results**: Extensive experiments were conducted on two reinforcement learning platforms, including Maniskill2 and Robomimic, covering 10 different tasks. The experimental results show that the proposed method significantly outperforms five other baseline methods in most tasks, achieving the best success rate in 8 tasks and second place in the other 2 tasks. In summary, the paper aims to improve scene representation methods to enhance the performance of vision-based reinforcement learning algorithms in understanding and operating in real-world 3D environments.

Query-based Semantic Gaussian Field for Scene Representation in Reinforcement Learning

Reinforcement Learning with Generalizable Gaussian Splatting

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Unbounded-GS: Extending 3D Gaussian Splatting with Hybrid Representation for Unbounded Large-Scale Scene Reconstruction

Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning

PyGS: Large-scale Scene Representation with Pyramidal 3D Gaussian Splatting

GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

3D Vision-Language Gaussian Splatting

VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction

A Refined 3D Gaussian Representation for High-Quality Dynamic Scene Reconstruction

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction

S^3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving

$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving