Abstract:Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at <a class="link-external link-https" href="https://github.com/TuSimple/centerformer" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the challenges encountered by existing LiDAR - based 3D object detection methods when dealing with large - scale point cloud data, especially the lack of global information learning and high computational complexity in two - stage networks. Specifically: 1. **Limitations of existing methods**: - **Limitations of two - stage networks**: Current two - stage networks (such as RCNN - style networks) mainly rely on local features for bounding box prediction, ignoring the features of other boxes or adjacent positions, which may be beneficial for improving detection results. - **High computational complexity**: Traditional Transformer - based methods (such as DETR) have high computational complexity when dealing with large - scale point cloud data, are difficult to converge, and their performance is limited by the dimension of input features. 2. **Proposed solutions to the problems**: - **CenterFormer model**: The paper proposes a center - point - based Transformer network (CenterFormer) for 3D object detection. This model selects candidate centers by using the center - point heat map and uses the features of these center points as Transformer query embeddings. In addition, a cross - attention mechanism is designed to fuse multi - frame features. - **Reducing computational complexity**: By introducing multi - scale cross - attention layers and deformable cross - attention layers, the convergence difficulty and computational complexity of the Transformer structure are effectively reduced. - **Improving detection performance**: The experimental results show that CenterFormer has achieved significant performance improvement on the Waymo Open Dataset. Especially in the single - model case, mAPH has reached 73.7% (validation set) and 75.6% (test set) respectively, far exceeding all previously published CNN and Transformer - based methods. In summary, this paper aims to solve the problems of high computational complexity and insufficient global information learning in existing 3D object detection methods when dealing with large - scale point cloud data by introducing a center - point - based Transformer architecture, thereby improving detection performance.

CenterFormer: Center-based Transformer for 3D Object Detection

SEFormer: Structure Embedding Transformer for 3D Object Detection

Anchor-Based Transformer for Temporal LiDAR 3D Object Detection

OcTr: Octree-based Transformer for 3D Object Detection

3D point cloud object detection algorithm based on Transformer

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Object Detection of Occlusion Point Cloud based on Transformer.

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving.

Multi-Scale Spatial Transformer Network for LiDAR-Camera 3D Object Detection.

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

HCT-Det: a Hybrid CNN-transformer Architecture for 3D Object Detection from Point Clouds

DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds

Collect-and-Distribute Transformer for 3D Point Cloud Analysis

TransCenter: Transformers With Dense Representations for Multiple-Object Tracking

Improving 3D Object Detection with Channel-wise Transformer

Multi-Source Features Fusion Single Stage 3D Object Detection with Transformer.

LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

MVTr: Multi-Feature Voxel Transformer for 3D Object Detection

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

CenterNet3D: An Anchor Free Object Detector for Point Cloud