Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

Huang, Xiao
DOI: https://doi.org/10.1007/s11063-022-11142-8
IF: 2.565
2023-02-21
Neural Processing Letters
Abstract:3D LiDAR semantic segmentation has significant applications in environmental perception, such as auto-driving and intelligent robotics. For autonomous cars equipped with cameras and LiDAR, it is essential to fuse complementary modal information, such as camera color and point cloud depth for environment perception. However, existing fusion algorithms may not achieve encouraging performance owing to the enormous difference between the two modalities. In this work, we propose a Transformer-based cross-modal information fusion network (TCIFNet) scheme to explore model discrepancies. To this end, we first project the point clouds onto the camera coordinates to provide spatial depth information. Then we use Transformer to extract features and fuse the features with effective residual-based self-attentive modules. For the camera stream, additional window-based masked images module improve by 1.8% in mIoU. Moreover, we propose a multimodal distillation loss to measure the difference between the two modalities. Extensive experiments on the benchmark data set show the superiority of our method. Specifically, compared to the state-of-the-art model, we achieve better completeness as well as better robustness in the category with more points.
computer science, artificial intelligence
What problem does this paper attempt to address?