Abstract:3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can lift any 2D object detector to multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detectors to generate object queries conditioned on the rich image semantics. These dynamically generated queries help MV2D to recall objects in the field of view and show a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate the dynamic object queries and sparse feature aggregation can promote 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research. Code is available at \url{<a class="link-external link-https" href="https://github.com/tusen-ai/MV2D" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### The Problem Addressed by This Paper The main goal of this paper is to improve the performance of 3D object detection in multi-view images. Specifically, the paper proposes a new framework called **Multi-View 2D Objects guided 3D Object Detector (MV2D)**. The core idea is to generate dynamic object queries from 2D detection results to enhance the effectiveness of 3D object detection. #### Main Issues and Solution Overview: 1. **Problems with Existing 3D Detection Methods**: - Single-view 3D detection methods cannot fully utilize the geometric configuration of surrounding cameras and the correspondence between multi-view images. - Existing multi-view methods require complex cross-camera post-processing steps, leading to reduced efficiency. - Dense 3D methods (such as bird's-eye view) can unify multi-view images, but the computational cost increases with the detection range. - Methods based on fixed queries may produce false positives or miss detections in dynamic scenes. 2. **Proposed Solution**: - Utilize efficient 2D object detectors to generate high-quality 2D bounding boxes, from which dynamic object queries are generated. - Propose a sparse cross-attention module to suppress noise interference, allowing the generated queries to focus on the features of specific objects. - Improve the accuracy and recall rate of 3D object detection through dynamically generated queries and feature extraction from relevant regions. 3. **Experimental Validation**: - Evaluate on the standard nuScenes dataset, demonstrating that MV2D can significantly improve 3D detection performance and achieve state-of-the-art levels. In summary, this paper aims to leverage the advantages of 2D detection to enhance multi-view 3D object detection through dynamically generated object queries, particularly improving performance on small and distant objects.

Object as Query: Lifting any 2D Object Detector to 3D Detection

Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

MT-SSD: Single-Stage 3D Object Detector Based on Magnification Transformation

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Enhance the 3D Object Detection With 2D Prior

DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Multi-View Attentive Contextualization for Multi-View 3D Object Detection

MVMM: Multiview Multimodal 3-D Object Detection for Autonomous Driving

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection

Multi-view 3D Object Detection Network for Autonomous Driving

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

MVM3Det: A Novel Method for Multi-view Monocular 3D Detection

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection