Abstract:Holistically understanding an object and its 3D movable parts through visual perception models is essential for enabling an autonomous agent to interact with the world. For autonomous driving, the dynamics and states of vehicle parts such as doors, the trunk, and the bonnet can provide meaningful semantic information and interaction states, which are essential to ensuring the safety of the self-driving vehicle. Existing visual perception models mainly focus on coarse parsing such as object bounding box detection or pose estimation and rarely tackle these situations. In this paper, we address this important autonomous driving problem by solving three critical issues. First, to deal with data scarcity, we propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images before reconstructing human-vehicle interaction (VHI) scenarios. Our approach is fully automatic without any human interaction, which can generate a large number of vehicles in uncommon states (VUS) for training deep neural networks (DNNs). Second, to perform fine-grained vehicle perception, we present a multi-task network for VUS parsing and a multi-stream network for VHI parsing. Third, to quantitatively evaluate the effectiveness of our data augmentation approach, we build the first VUS dataset in real traffic scenarios (e.g., getting on/out or placing/removing luggage). Experimental results show that our approach advances other baseline methods in 2D detection and instance segmentation by a big margin (over 8%). In addition, our network yields large improvements in discovering and understanding these uncommon cases. Moreover, we have released the source code, the dataset, and the trained model on Github (https://github.com/zongdai/EditingForDNN).

DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions

DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

3D Part Guided Image Editing for Fine-Grained Object Understanding

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

DiVE: DiT-based Video Generation with Enhanced Control

ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing

Edit3D: Elevating 3D Scene Editing with Attention-Driven Multi-Turn Interactivity

Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

Text-driven Editing of 3D Scenes without Retraining

3DSceneEditor: Controllable 3D Scene Editing with Gaussian Splatting

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

Object-Centric Diffusion for Efficient Video Editing

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

PAIR-Diffusion: A Comprehensive Multimodal Object-Level Image Editor

Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation