EPAWFusion: multimodal fusion for 3D object detection based on enhanced points and adaptive weights
Xiang Sun,Shaojing Song,Fan Wu,Tingting Lu,Bohao Li,Zhiqing Miao
DOI: https://doi.org/10.1117/1.jrs.18.017501
IF: 1.568
2024-01-14
Journal of Applied Remote Sensing
Abstract:Fusing LiDAR point cloud and camera image for 3D object detection in autonomous driving has emerged as a captivating research avenue. The core challenge of multimodal fusion is how to seamlessly fuse 3D LiDAR point cloud with 2D camera image. Although current approaches exhibit promising results, they often rely solely on fusion at either the data level, feature level, or object level, and there is still a room for improvement in the utilization of multimodal information. We present an advanced and effective multimodal fusion framework called EPAWFusion for fusing 3D point cloud and 2D camera image at both data level and feature level. EPAWFusion model consists of three key modules: a point enhanced module based on semantic segmentation for data-level fusion, an adaptive weight allocation module for feature-level fusion, and a detector based on 3D sparse convolution. The semantic information of the 2D image is extracted using semantic segmentation, and the calibration matrix is used to establish the point-pixel correspondence. The semantic information and distance information are then attached to the point cloud to achieve data-level fusion. The geometry features of enhanced point cloud are extracted by voxel encoding, and the texture features of image are obtained using a pretrained 2D CNN. Feature-level fusion is achieved via the adaptive weight allocation module. The fused features are fed into a 3D sparse convolution-based detector to obtain the accurate 3D objects. Experiment results demonstrate that EPAWFusion outperforms the baseline network MVXNet on the KITTI dataset for 3D detection of cars, pedestrians, and cyclists by 5.81%, 6.97%, and 3.88%. Additionally, EPAWFusion performs well for single-vehicle-side 3D object detection based on the experimental findings on DAIR-V2X dataset and the inference frame rate of our proposed model reaches 11.1 FPS. The two-layer level fusion of EPAWFusion significantly enhances the performance of multimodal 3D object detection.
environmental sciences,imaging science & photographic technology,remote sensing