Object as Query: Lifting any 2D Object Detector to 3D Detection

Zitian Wang,Zehao Huang,Jiahui Fu,Naiyan Wang,Si Liu
2023-11-06
Abstract:3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can lift any 2D object detector to multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detectors to generate object queries conditioned on the rich image semantics. These dynamically generated queries help MV2D to recall objects in the field of view and show a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate the dynamic object queries and sparse feature aggregation can promote 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research. Code is available at \url{<a class="link-external link-https" href="https://github.com/tusen-ai/MV2D" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by This Paper The main goal of this paper is to improve the performance of 3D object detection in multi-view images. Specifically, the paper proposes a new framework called **Multi-View 2D Objects guided 3D Object Detector (MV2D)**. The core idea is to generate dynamic object queries from 2D detection results to enhance the effectiveness of 3D object detection. #### Main Issues and Solution Overview: 1. **Problems with Existing 3D Detection Methods**: - Single-view 3D detection methods cannot fully utilize the geometric configuration of surrounding cameras and the correspondence between multi-view images. - Existing multi-view methods require complex cross-camera post-processing steps, leading to reduced efficiency. - Dense 3D methods (such as bird's-eye view) can unify multi-view images, but the computational cost increases with the detection range. - Methods based on fixed queries may produce false positives or miss detections in dynamic scenes. 2. **Proposed Solution**: - Utilize efficient 2D object detectors to generate high-quality 2D bounding boxes, from which dynamic object queries are generated. - Propose a sparse cross-attention module to suppress noise interference, allowing the generated queries to focus on the features of specific objects. - Improve the accuracy and recall rate of 3D object detection through dynamically generated queries and feature extraction from relevant regions. 3. **Experimental Validation**: - Evaluate on the standard nuScenes dataset, demonstrating that MV2D can significantly improve 3D detection performance and achieve state-of-the-art levels. In summary, this paper aims to leverage the advantages of 2D detection to enhance multi-view 3D object detection through dynamically generated object queries, particularly improving performance on small and distant objects.