CenterFormer: Center-based Transformer for 3D Object Detection

Zixiang Zhou,Xiangchen Zhao,Yu Wang,Panqu Wang,Hassan Foroosh
DOI: https://doi.org/10.48550/arXiv.2209.05588
2022-09-13
Abstract:Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at <a class="link-external link-https" href="https://github.com/TuSimple/centerformer" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the challenges encountered by existing LiDAR - based 3D object detection methods when dealing with large - scale point cloud data, especially the lack of global information learning and high computational complexity in two - stage networks. Specifically: 1. **Limitations of existing methods**: - **Limitations of two - stage networks**: Current two - stage networks (such as RCNN - style networks) mainly rely on local features for bounding box prediction, ignoring the features of other boxes or adjacent positions, which may be beneficial for improving detection results. - **High computational complexity**: Traditional Transformer - based methods (such as DETR) have high computational complexity when dealing with large - scale point cloud data, are difficult to converge, and their performance is limited by the dimension of input features. 2. **Proposed solutions to the problems**: - **CenterFormer model**: The paper proposes a center - point - based Transformer network (CenterFormer) for 3D object detection. This model selects candidate centers by using the center - point heat map and uses the features of these center points as Transformer query embeddings. In addition, a cross - attention mechanism is designed to fuse multi - frame features. - **Reducing computational complexity**: By introducing multi - scale cross - attention layers and deformable cross - attention layers, the convergence difficulty and computational complexity of the Transformer structure are effectively reduced. - **Improving detection performance**: The experimental results show that CenterFormer has achieved significant performance improvement on the Waymo Open Dataset. Especially in the single - model case, mAPH has reached 73.7% (validation set) and 75.6% (test set) respectively, far exceeding all previously published CNN and Transformer - based methods. In summary, this paper aims to solve the problems of high computational complexity and insufficient global information learning in existing 3D object detection methods when dealing with large - scale point cloud data by introducing a center - point - based Transformer architecture, thereby improving detection performance.