Abstract:Multi-View 3D object detection (MV3D) has made tremendous progress by leveraging multiple perspective features through surrounding cameras. Despite demonstrating promising prospects in various applications, accurately detecting objects through camera view in the 3D space is extremely difficult due to the ill-posed issue in monocular depth estimation. Recently, Graph-DETR3D presents a novel graph-based 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves competitive performance. Although it enriches the query representations with 2D image features through a learnable 3D graph, it still suffers from limited depth and velocity estimation abilities due to the adoption of a single-frame input setting. To solve this problem, we introduce a unified spatial-temporal graph modeling framework to fully leverage the multi-view imagery cues under the multi-frame inputs setting. Thanks to the flexibility and sparsity of the dynamic graph architecture, we lift the original 3D graph into the 4D space with an effective attention mechanism to automatically perceive imagery information at both spatial and temporal levels. Moreover, considering the main latency bottleneck lies in the image backbone, we propose a novel dense-sparse distillation framework for multi-view 3D object detection, to reduce the computational budget while sacrificing no detection accuracy, making it more suitable for real-world deployment. To this end, we propose Graph-DETR4D, a faster and stronger multi-view 3D object detection framework, built on top of Graph-DETR3D. Extensive experiments on nuScenes and Waymo benchmarks demonstrate the effectiveness and efficiency of Graph-DETR4D. Notably, our best model achieves 62.0% NDS on nuScenes test leaderboard. Code is available at https://github.com/zehuichen123/Graph-DETR4D.

PETR: Position Embedding Transformation for Multi-View 3D Object Detection

PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Position-Guided Point Cloud Panoptic Segmentation Transformer

OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition.

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search with Supplementary Materials

MUTR3D: A Multi-camera Tracking Framework Via 3D-to-2d Queries

Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

Geometric-aware Pretraining for Vision-centric 3D Object Detection