Multi-View Attentive Contextualization for Multi-View 3D Object Detection

Xianpeng Liu,Ce Zheng,Ming Qian,Nan Xue,Chen Chen,Zhebin Zhang,Chen Li,Tianfu Wu
2024-05-21
Abstract:We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of 2D - to - 3D feature lifting in multi - view 3D object detection (MV3D). Specifically, existing methods face the following two main challenges when mapping 2D features to 3D space: 1. **Insufficient utilization of high - resolution 2D features**: Some methods are unable to fully utilize high - resolution 2D features for feature lifting under the dense attention mechanism due to high computational costs. 2. **Sparse alignment between 3D queries and multi - scale 2D features**: Other methods do not align 3D queries with multi - scale 2D features sufficiently when using the sparse attention mechanism, resulting in poor performance. To solve these problems, the authors propose the **Multi - View Attentive Contextualization (MvACon)** method. MvACon simultaneously solves the above two problems by introducing an attention feature contextualization scheme that is dense in representation but sparse in computation. This method can enhance the effect of 2D - to - 3D feature lifting without incurring excessive computational burden, and it performs particularly well in position, orientation, and velocity prediction. ### Main contributions of MvACon - **Analyzes and solves the limitations of existing methods**: In particular, the lack of sufficient 3D representation ability, which leads to insufficient local 3D perception. - **Proposes an easily integrable method**: MvACon can be applied to both decoder - based and encoder - decoder - based mainstream query - based MV3D object detection frameworks to enhance their 2D - to - 3D feature - lifting capabilities. - **Experimental verification of performance improvement**: On the NuScenes dataset and the Waymo - mini benchmark, MvACon significantly improves the detection performance of multiple baseline models, especially in position, orientation, and velocity prediction. ### Specific implementation The core idea of MvACon is to contextualize 2D features through the cluster - based attention mechanism, enabling them to better connect to other feature points in the entire scene, thereby inducing global 3D perception. The specific steps are as follows: 1. **For decoder - based detectors (such as PETR)**: Each 2D feature map is processed through the MvACon module, allowing each feature point to connect to the entire feature map, thereby enhancing global 3D perception. 2. **For encoder - decoder - based detectors (such as BEVFormer)**: Multi - scale feature maps are processed through the MvACon module, allowing each feature point to connect to the entire L - layer feature pyramid, thereby enhancing global 3D perception. In this way, MvACon can significantly enhance the effect of 2D - to - 3D feature lifting while maintaining computational efficiency, thereby improving the overall performance of multi - view 3D object detection. ### Experimental results The experimental results show that MvACon achieves significant performance improvements on multiple benchmark datasets, especially in position, orientation, and velocity prediction. This proves the effectiveness and generality of MvACon.