3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Changyong Shu,JIajun Deng,Fisher Yu,Yifan Liu
2023-07-28
Abstract:Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works found that encodings based on samples of the 3D viewing rays can significantly improve the quality of multi-camera 3D object detection. We hypothesize that 3D point locations can provide more information than rays. Therefore, we introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder. Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions. Our hybriddepth module combines direct and categorical depth to estimate the refined depth of each pixel. Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset, significantly outperforming encodings based on ray samples. We make the codes available at <a class="link-external link-https" href="https://github.com/drilistbox/3DPPE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the positional encoding problem in multi-camera 3D object detection, particularly for Transformer-based methods. Specifically, the paper tackles the following key issues: 1. **Improved Positional Encoding Mechanism**: In existing methods, ray-based positional encoding (3D camera-ray PE) only provides coarse positional information, and the efficiency of the attention mechanism used in the Transformer decoder is limited by the inconsistency between reference points and the image feature representation space. 2. **Accurate Localization**: To improve the accuracy of 3D object detection, the paper proposes a new 3D Point Positional Encoding (3DPPE), which utilizes predicted depth to more accurately locate the 3D point positions corresponding to pixels on the image plane. 3. **Unified Representation Space**: The paper also proposes a shared positional encoder to handle the transformed 3D points and reference points, thereby constructing a unified embedding space to enhance the effectiveness of the attention mechanism. 4. **Depth Estimation Module**: Considering that monocular 3D object detection cannot directly obtain 3D measurements during inference, the paper introduces a lightweight depth estimation module to approximate the real point positions. This includes a hybrid depth module that combines direct regression depth and classification depth to improve the quality of depth estimation. Through the aforementioned methods, 3DPPE aims to overcome the limitations of existing methods and improve the performance of 3D object detection based on multi-camera systems. Experiments demonstrate that 3DPPE significantly improves detection accuracy on the nuScenes dataset compared to ray-based positional encoding.