BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection

Wenjie Wang,Yehao Lu,Guangcong Zheng,Shuigenzhan,Xiaoqing Ye,Zichang Tan,Jingdong Wang,Gaoang Wang,Xi Li
DOI: https://doi.org/10.1109/cvpr52733.2024.01394
2024-01-01
Abstract:Vision-based roadside 3D object detection has attracted rising attention inautonomous driving domain, since it encompasses inherent advantages in reducingblind spots and expanding perception range. While previous work mainly focuseson accurately estimating depth or height for 2D-to-3D mapping, ignoring theposition approximation error in the voxel pooling process. Inspired by thisinsight, we propose a novel voxel pooling strategy to reduce such error, dubbedBEVSpread. Specifically, instead of bringing the image features contained in afrustum point to a single BEV grid, BEVSpread considers each frustum point as asource and spreads the image features to the surrounding BEV grids withadaptive weights. To achieve superior propagation performance, a specificweight function is designed to dynamically control the decay speed of theweights according to distance and depth. Aided by customized CUDA parallelacceleration, BEVSpread achieves comparable inference time as the originalvoxel pooling. Extensive experiments on two large-scale roadside benchmarksdemonstrate that, as a plug-in, BEVSpread can significantly improve theperformance of existing frustum-based BEV methods by a large margin of (1.12,5.26, 3.01) AP in vehicle, pedestrian and cyclist.
What problem does this paper attempt to address?