Abstract:Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene layout estimation present three critical tasks for driving scene perception, which is fundamental for motion planning and navigation in autonomous driving. Though they are complementary to each other, prior works usually focus on each individual task and rarely deal with all three tasks together. A naive way is to accomplish them independently in a sequential or parallel manner, but there are many drawbacks, i.e., 1) the depth and VO results suffer from the inherent scale ambiguity issue; 2) the BEV layout is directly predicted from the front-view image without using any depth-related information, although the depth map contains useful geometry clues for inferring scene layouts. In this paper, we address these issues by proposing a novel joint perception framework named JPerceiver, which can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence. It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO based on a carefully-designed scale loss. Meanwhile, a cross-view and cross-modal transfer (CCT) module is devised to leverage the depth clues for reasoning road and vehicle layout through an attention mechanism. JPerceiver can be trained in an end-to-end multi-task learning way, where the CGT scale loss and CCT module promote inter-task knowledge transfer to benefit feature learning of each task. Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks in terms of accuracy, model size, and inference speed. The code and models are available at~\href{<a class="link-external link-https" href="https://github.com/sunnyHelen/JPerceiver" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/sunnyHelen/JPerceiver" rel="external noopener nofollow">this https URL</a>}.

AutoLay: Benchmarking amodal layout estimation for autonomous driving

MonoLayout: Amodal scene layout from a single image

Amodal Layout Completion in Complex Outdoor Scenes.

AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving

ONCE-3DLanes: Building Monocular 3D Lane Detection

3D Vehicle Detection Using Cheap LiDAR and Camera Sensors.

Learning from Maps: Visual Common Sense for Autonomous Driving

The Earth ain't Flat: Monocular Reconstruction of Vehicles on Steep and Graded Roads from a Moving Camera

Understanding Bird's-Eye View of Road Semantics using an Onboard Camera

Occlusion-Aware 2D and 3D Centerline Detection for Urban Driving via Automatic Label Generation

BLVD: Building A Large-scale 5D Semantics Benchmark for Autonomous Driving

Monocular Multi-Layer Layout Estimation for Warehouse Racks

Lane Detection and Tracking Datasets: Efficient Investigation and New Measurement by a Novel "Dataset Scenario Detector" Application

IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments

KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D

A Dataset for Lane Instance Segmentation in Urban Environments

Drive&Act: A Multi-Modal Dataset for Fine-Grained Driver Behavior Recognition in Autonomous Vehicles

Proximity based automatic data annotation for autonomous driving

JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

A new representation of scene layout improves saliency detection in traffic scenes

XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis