3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Changyong Shu,JIajun Deng,Fisher Yu,Yifan Liu

2023-07-28

Abstract:Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works found that encodings based on samples of the 3D viewing rays can significantly improve the quality of multi-camera 3D object detection. We hypothesize that 3D point locations can provide more information than rays. Therefore, we introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder. Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions. Our hybriddepth module combines direct and categorical depth to estimate the refined depth of each pixel. Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset, significantly outperforming encodings based on ray samples. We make the codes available at <a class="link-external link-https" href="https://github.com/drilistbox/3DPPE" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily focuses on addressing the positional encoding problem in multi-camera 3D object detection, particularly for Transformer-based methods. Specifically, the paper tackles the following key issues: 1. **Improved Positional Encoding Mechanism**: In existing methods, ray-based positional encoding (3D camera-ray PE) only provides coarse positional information, and the efficiency of the attention mechanism used in the Transformer decoder is limited by the inconsistency between reference points and the image feature representation space. 2. **Accurate Localization**: To improve the accuracy of 3D object detection, the paper proposes a new 3D Point Positional Encoding (3DPPE), which utilizes predicted depth to more accurately locate the 3D point positions corresponding to pixels on the image plane. 3. **Unified Representation Space**: The paper also proposes a shared positional encoder to handle the transformed 3D points and reference points, thereby constructing a unified embedding space to enhance the effectiveness of the attention mechanism. 4. **Depth Estimation Module**: Considering that monocular 3D object detection cannot directly obtain 3D measurements during inference, the paper introduces a lightweight depth estimation module to approximate the real point positions. This includes a hybrid depth module that combines direct regression depth and classification depth to improve the quality of depth estimation. Through the aforementioned methods, 3DPPE aims to overcome the limitations of existing methods and improve the performance of 3D object detection based on multi-camera systems. Experiments demonstrate that 3DPPE significantly improves detection accuracy on the nuScenes dataset compared to ray-based positional encoding.

3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

SEFormer: Structure Embedding Transformer for 3D Object Detection

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Introducing Depth into Transformer-based 3D Object Detection

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

Position-Guided Point Cloud Panoptic Segmentation Transformer

DTSSD: Dual-Channel Transformer-Based Network for Point-Based 3D Object Detection

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

PointCSE: Context-sensitive encoders for efficient 3D object detection from point cloud

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

MonoPSTR: Monocular 3-D Object Detection with Dynamic Position and Scale-Aware Transformer

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

Transformer-based stereo-aware 3D object detection from binocular images

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection