D-CONFORMER: Deformable Sparse Transformer Augmented Convolution for Voxel-Based 3D Object Detection

Xiao Zhao,Liuzhen Su,Xukun Zhang,Dingkang Yang,Mingyang Sun,Shunli Wang,Peng Zhai,Lihua Zhang
DOI: https://doi.org/10.1109/icassp49357.2023.10097060
2023-01-01
ICASSP
Abstract:Although CNN-based and Transformer-based detectors have made impressive improvements in 3D object detection, these two network paradigms suffer from the interference of insufficient receptive field and local detail weakening, which significantly limits the feature extraction performance of the backbone. In this paper, we propose to fuse convolution and transformer, and simultaneously considering the different contributions of non-empty voxels at different positions in 3D space to object detection, it is not consistent with applying standard convolution and transformer directly on voxels. Specifically, we design a novel deformable sparse transformer to perform long-range information interaction on fine-grained local detail semantics aggregated by focal sparse convolution, termed D-Conformer. D-Conformer learns valuable voxels with position-wise in sparse space and can be applied to most voxel-based detectors as a backbone. Extensive experiments demonstrate that our method achieves satisfactory detection results and outperforms state-of-the-art 3D detection methods by a large margin.
What problem does this paper attempt to address?