MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

Zhiheng Li,Yubo Cui,Yu Lin,Zheng Fang
DOI: https://doi.org/10.48550/arXiv.2305.06794
2023-08-15
Abstract:3D single object tracking plays a crucial role in computer vision. Mainstream methods mainly rely on point clouds to achieve geometry matching between target template and search area. However, textureless and incomplete point clouds make it difficult for single-modal trackers to distinguish objects with similar structures. To overcome the limitations of geometry matching, we propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target. Specifically, we first propose a Space Alignment Module (SAM) to align RGB images with point clouds in 3D space, which is the prerequisite for constructing inter-modal associations. Then, in feature interaction level, we design a Feature Interaction Module (FIM) based on dual-stream structure, which enhances intra-modal features in parallel and constructs inter-modal semantic associations. Meanwhile, in order to refine each modal feature, we introduce a Coarse-to-Fine Interaction Module (CFIM) to realize the hierarchical feature interaction at different scales. Finally, in similarity fusion level, we propose a Similarity Fusion Module (SFM) to aggregate geometry and texture clues from the target. Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges of 3D single - object tracking (3D SOT) in computer vision and autonomous driving. Specifically, the existing mainstream methods mainly rely on point clouds for geometric matching. However, due to the lack of texture and incompleteness of point clouds, it is difficult for unimodal trackers to distinguish objects with similar structures. To solve this problem, the authors propose a multi - modal multi - level fusion tracker (MMF - Track), aiming to combine image texture and point - cloud geometric features to improve the accuracy of 3D object tracking. ### Specific Problems and Solutions 1. **Limitations of Geometric Matching**: - The existing 3D SOT methods mainly rely on point clouds for geometric matching, but point - cloud data are usually sparse and lack texture information, which makes it difficult for unimodal trackers to distinguish objects with similar structures. - **Solution**: MMF - Track improves the accuracy of target identification by introducing image texture information and combining point - cloud geometric features. 2. **Multi - modal Data Alignment**: - Multi - modal trackers need to align data from different sensors and establish associations between search and template areas, which is more complex than 3D detection tasks. - **Solution**: The authors propose a space alignment module (SAM) to align RGB images with point clouds in 3D space, thus laying the foundation for subsequent multi - modal semantic associations. 3. **Feature Interaction and Fusion**: - The feature information at a single scale is limited and it is difficult to construct complex semantic associations. - **Solution**: The authors design a coarse - to - fine interaction module (CFIM) to use multi - scale features to gradually refine and enhance the feature representations of each modality. 4. **Similarity Fusion**: - Previous methods only use the geometric similarity of point clouds and ignore texture information. - **Solution**: The authors propose a similarity fusion module (SFM), which not only generates geometric similarity but also generates texture similarity, and improves the tracking performance by adaptively fusing the two similarities. ### Summary MMF - Track overcomes the limitations of unimodal 3D SOT methods in dealing with texture - missing and incomplete point clouds by combining image texture and point - cloud geometric features. Through modules such as space alignment, feature interaction and similarity fusion, MMF - Track achieves more robust 3D object tracking performance. Experimental results show that this method has achieved excellent performance on the KITTI and NuScenes datasets.