X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

Yichen Xie,Chenfeng Xu,Chensheng Peng,Shuqi Zhao,Nhat Ho,Alexander T. Pham,Mingyu Ding,Masayoshi Tomizuka,Wei Zhan

2024-11-02

Abstract:Recent advancements have exploited diffusion models for the synthesis of either LiDAR point clouds or camera image data in driving scenarios. Despite their success in modeling single-modality data marginal distribution, there is an under-exploration in the mutual reliance between different modalities to describe complex driving scenes. To fill in this gap, we propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images via a dual-branch latent diffusion model architecture. Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions from the other modality, ensuring better alignment and realism. To further handle the spatial ambiguity during denoising, we design the cross-modality condition module based on epipolar lines to adaptively learn the cross-modality local correspondence. Besides, X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds. Extensive results demonstrate the high-fidelity synthetic results of X-DRIVE for both point clouds and multi-view images, adhering to input conditions while ensuring reliable cross-modality consistency. Our code will be made publicly available at <a class="link-external link-https" href="https://github.com/yichen928/X-Drive" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of generating high-quality, multi-modal sensor data (i.e., LiDAR point clouds and multi-view images) in autonomous driving scenarios. Specifically, while existing generative models have made significant progress in generating single-modal data, there are several challenges when it comes to generating multi-modal data: 1. **Spatial Alignment**: The synthesized point clouds and multi-view images must maintain spatial alignment in all local regions, as they describe the same driving scene. This means that the shapes and layouts of the foreground and background must match. 2. **Geometric Space Differences**: Point clouds and multi-view images have different geometric spaces and data formats. Multi-view images are represented by RGB values from the camera's perspective, while point clouds are represented by XYZ coordinates in 3D space. 3. **Positional Ambiguity**: During the generation process, there is ambiguity in the spatial information of point clouds and multi-view images due to the lack of reliable point position or pixel depth information. To overcome these challenges, the paper proposes a new framework called X-D RIVE, which uses a dual-branch latent diffusion model architecture to jointly generate point clouds and multi-view images, and introduces a cross-modal conditioning module to enhance consistency between modalities. X-D RIVE not only generates high-quality multi-modal data but also ensures consistency between different modalities, thereby improving the performance of downstream tasks.

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving

XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

Driving Scene Synthesis on Free-form Trajectories with Generative Prior

SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

MagicDrive: Street View Generation with Diverse 3D Geometry Control

GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model

A LiDAR Point Cloud Generator: from a Virtual World to Autonomous Driving

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

Real-to-Virtual Domain Unification for End-to-End Autonomous Driving

CrossFuser: Multi-Modal Feature Fusion for End-to-End Autonomous Driving Under Unseen Weather Conditions