Abstract:We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of 2D - to - 3D feature lifting in multi - view 3D object detection (MV3D). Specifically, existing methods face the following two main challenges when mapping 2D features to 3D space: 1. **Insufficient utilization of high - resolution 2D features**: Some methods are unable to fully utilize high - resolution 2D features for feature lifting under the dense attention mechanism due to high computational costs. 2. **Sparse alignment between 3D queries and multi - scale 2D features**: Other methods do not align 3D queries with multi - scale 2D features sufficiently when using the sparse attention mechanism, resulting in poor performance. To solve these problems, the authors propose the **Multi - View Attentive Contextualization (MvACon)** method. MvACon simultaneously solves the above two problems by introducing an attention feature contextualization scheme that is dense in representation but sparse in computation. This method can enhance the effect of 2D - to - 3D feature lifting without incurring excessive computational burden, and it performs particularly well in position, orientation, and velocity prediction. ### Main contributions of MvACon - **Analyzes and solves the limitations of existing methods**: In particular, the lack of sufficient 3D representation ability, which leads to insufficient local 3D perception. - **Proposes an easily integrable method**: MvACon can be applied to both decoder - based and encoder - decoder - based mainstream query - based MV3D object detection frameworks to enhance their 2D - to - 3D feature - lifting capabilities. - **Experimental verification of performance improvement**: On the NuScenes dataset and the Waymo - mini benchmark, MvACon significantly improves the detection performance of multiple baseline models, especially in position, orientation, and velocity prediction. ### Specific implementation The core idea of MvACon is to contextualize 2D features through the cluster - based attention mechanism, enabling them to better connect to other feature points in the entire scene, thereby inducing global 3D perception. The specific steps are as follows: 1. **For decoder - based detectors (such as PETR)**: Each 2D feature map is processed through the MvACon module, allowing each feature point to connect to the entire feature map, thereby enhancing global 3D perception. 2. **For encoder - decoder - based detectors (such as BEVFormer)**: Multi - scale feature maps are processed through the MvACon module, allowing each feature point to connect to the entire L - layer feature pyramid, thereby enhancing global 3D perception. In this way, MvACon can significantly enhance the effect of 2D - to - 3D feature lifting while maintaining computational efficiency, thereby improving the overall performance of multi - view 3D object detection. ### Experimental results The experimental results show that MvACon achieves significant performance improvements on multiple benchmark datasets, especially in position, orientation, and velocity prediction. This proves the effectiveness and generality of MvACon.

Multi-View Attentive Contextualization for Multi-View 3D Object Detection

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

Object as Query: Lifting any 2D Object Detector to 3D Detection

Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability

CAMVR: Context-Adaptive Multi-View Representation Learning for Dense Retrieval

MVM3Det: A Novel Method for Multi-view Monocular 3D Detection

Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection

Multiview Detection with Feature Perspective Transformation

MV-C3D: A Spatial Correlated Multi-View 3D Convolutional Neural Networks

Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

CI3D: Context Interaction for Dynamic Objects and Static Map Elements in 3D Driving Scenes.

Cascaded Multi-3D-view Fusion for 3D-Oriented Object Detection

VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

Multi-View 3D Object Detection Network for Autonomous Driving

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

MDHA: Multi-Scale Deformable Transformer with Hybrid Anchors for Multi-View 3D Object Detection

End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds

Improving 3D Object Detection with Context-Aware and Dimensional Interaction Attention

MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation