Abstract:Multi-camera 3D perception has emerged as a prominent research field in autonomous driving, offering a viable and cost-effective alternative to LiDAR-based solutions. The existing multi-camera algorithms primarily rely on monocular 2D pre-training. However, the monocular 2D pre-training overlooks the spatial and temporal correlations among the multi-camera system. To address this limitation, we propose the first multi-camera unified pre-training framework, called UniScene, which involves initially reconstructing the 3D scene as the foundational stage and subsequently fine-tuning the model on downstream tasks. Specifically, we employ Occupancy as the general representation for the 3D scene, enabling the model to grasp geometric priors of the surrounding world through pre-training. A significant benefit of UniScene is its capability to utilize a considerable volume of unlabeled image-LiDAR pairs for pre-training purposes. The proposed multi-camera unified pre-training framework demonstrates promising results in key tasks such as multi-camera 3D object detection and surrounding semantic scene completion. When compared to monocular pre-training methods on the nuScenes dataset, UniScene shows a significant improvement of about 2.0% in mAP and 2.0% in NDS for multi-camera 3D object detection, as well as a 3% increase in mIoU for surrounding semantic scene completion. By adopting our unified pre-training method, a 25% reduction in 3D training annotation costs can be achieved, offering significant practical value for the implementation of real-world autonomous driving. Codes are publicly available at <a class="link-external link-https" href="https://github.com/chaytonmin/UniScene" rel="external noopener nofollow">this https URL</a>.

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Learning 3D Scene Priors with 2D Supervision

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Learning Shape Priors for Single-View 3D Completion and Reconstruction

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction

Zero-Shot Multi-Object Scene Completion

To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training

Enhancing Zero-shot 3D Photography Via Mesh-represented Image Inpainting

DEEP ZERO-SHOT LEARNING FOR SCENE SKETCH

Zero-Shot Scene Classification for High Spatial Resolution Remote Sensing Images

Semi-supervised Single-view 3D Reconstruction via Multi Shape Prior Fusion Strategy and Self-Attention

ZeroShape: Regression-based Zero-shot Shape Reconstruction

FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models

3D Surface Reconstruction in the Wild by Deforming Shape Priors from Synthetic Data

Delving into Shape-aware Zero-shot Semantic Segmentation

Robust 3D Shape Reconstruction in Zero-Shot from a Single Image in the Wild

Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving