Transformer Based Multi-modal Fusion for Place Recognition with Self-attention Mechanism

Yan Pan,Xiang Yu,Zeqing Chang,Fang,Bo Zhou
DOI: https://doi.org/10.1109/cac57257.2022.10055623
2022-01-01
Abstract:Autonomous driving still leaves a challenging task that how to apply the complementary information captured from different sensors, i.e. cameras and LiDAR, to handle place recognition task. In this paper, a brand new pipeline was designed to tackle the retrieval problem with a multi-modal late fusion. The light-weighted convolutional blocks is applied to encode features from images, followed by converting them into pseudo point clouds. Inspired by the progress of the Transformer, the network employs a residual point transformer module to extract feature vectors for 3D points and pseudo point clouds respectively. Finally, the two corresponding local descriptors are fused to get a robust fused global descriptor, which is capable of place recognition task after end-to-end training. Experimental results collected from sequence 00, 02, 05, 06 of KITTI dataset validate that fusing information from two modalities is able to enhance performance.
What problem does this paper attempt to address?