HCT-Det: a Hybrid CNN-transformer Architecture for 3D Object Detection from Point Clouds

Chao Wang
DOI: https://doi.org/10.1117/12.3005832
2023-01-01
Abstract:Detecting 3D objects from LiDAR points is significant for the environmental perception of robotic systems. Some pillarbased 3D object detectors solely use 2D convolutions as feature encoders, which occupy fewer computation resources but sacrifice model accuracy. To activate the potential performance of pillar-based feature representation manners, we propose HCT-Det, a novel hybrid CNN-Transformer architecture for 3D object detection from point clouds. Motivated by the structure re-parameterization technique and vision transformer (ViT) framework, we redesign the 2D backbone and further introduce the Rep-VGG block and multi-head self-attention (MHSA) mechanism to enrich the scale diversity of feature representation. We perform ablation experiments on the KITTI vision benchmarks to highlight the superiority of our HCT-Det. The evaluation results show that our model outperforms PointPillars baseline, yielding an accuracy of 79.08 moderate AP3D on the car category at a speed of 57.46 FPS on the NVIDIA Tesla P40 platform. Without bells and whistles, our HCT-Det can achieve a reasonable trade-off between accuracy and speed.
What problem does this paper attempt to address?