PMAFusion: Projection-Based Multi-Modal Alignment for 3D Semantic Occupancy Prediction

Shiyao Li,Wenming Yang,Qingmin Liao
DOI: https://doi.org/10.1109/cvprw63382.2024.00366
2024-01-01
Computer Vision and Pattern Recognition
Abstract:3D Semantic Occupancy Prediction offers a holistic scene understanding with both spatial structure and semantic analysis. Current research in this field primarily focuses on single-modal inputs, relying either on images or point cloud data. The potential of combining the complementary attributes of images and point clouds has not been fully explored. Previous method transforms image features into 3D space for direct concatenation with monocular depth estimation, which may introduce noises due to inaccurate depth prediction. It could also lead to substantial memory usage for explicitly constructing dense image feature volumes. To this end, we propose PMAFusion, an effective fusion module based on accurate multi-modal alignment. We first project the point cloud onto images using camera parameters, thereby aligning each voxel with its associated pixels. A cross-attention module is then used to adaptively fuse voxel-pixel features for improved representation. In order to handle empty voxels that are difficult to obtain aligned pixels naturally, we generate reference points through uniform sampling to supplement the missing spatial information. With PMAFusion, We yield the best results on the nuScenes-Occupancy dataset and conduct thorough experiments to evaluate the effectiveness and efficiency of our proposed method.
What problem does this paper attempt to address?