$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving

Nan Huang,Xiaobao Wei,Wenzhao Zheng,Pengju An,Ming Lu,Wei Zhan,Masayoshi Tomizuka,Kurt Keutzer,Shanghang Zhang
2024-05-31
Abstract:Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: <a class="link-external link-https" href="https://github.com/nnanhuang/S3Gaussian/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper proposes a new method called S3Gaussian (Self-Supervised Street Gaussians) to solve the problem of 3D street scene reconstruction in autonomous driving. Current techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Scatter (3DGS) suffer from issues like slow processing speed and inability to explicitly represent dynamic elements when dealing with driving scenes. Particularly, most 3DGS methods require tracking 3D vehicle bounding boxes to decompose static and dynamic elements, limiting their practical applications in the real world. S3Gaussian introduces a self-supervised approach that can decompose dynamic and static elements from 4D consistency without requiring costly annotations. It utilizes 3D Gaussians to maintain explicit representation and compactly models the 4D dynamics through a spatiotemporal field network. This approach is achieved by a multi-resolution Hexplane structure encoder and a multi-head Gaussian decoder, effectively handling complex spatiotemporal deformations and separating static and dynamic scenes. The main contributions of the paper include: 1. Introducing the first self-supervised method S3Gaussian that can decompose dynamic and static 3D Gaussians in street scenes without additional manual annotation. 2. Introducing an efficient spatiotemporal decomposition network that automatically captures deformations of 3D Gaussians. 3. Conducting extensive experiments on challenging datasets to demonstrate that S3Gaussian outperforms existing methods in scene reconstruction and novel view synthesis tasks, without relying on 3D annotations. In this way, S3Gaussian enables high-fidelity and real-time neural rendering of dynamic urban street scenes in autonomous driving simulations without 3D supervision. It addresses the limitations of existing methods in terms of training time, rendering speed, and the ability to differentiate between dynamic and static scenes.