Abstract:3D single object tracking plays a crucial role in computer vision. Mainstream methods mainly rely on point clouds to achieve geometry matching between target template and search area. However, textureless and incomplete point clouds make it difficult for single-modal trackers to distinguish objects with similar structures. To overcome the limitations of geometry matching, we propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target. Specifically, we first propose a Space Alignment Module (SAM) to align RGB images with point clouds in 3D space, which is the prerequisite for constructing inter-modal associations. Then, in feature interaction level, we design a Feature Interaction Module (FIM) based on dual-stream structure, which enhances intra-modal features in parallel and constructs inter-modal semantic associations. Meanwhile, in order to refine each modal feature, we introduce a Coarse-to-Fine Interaction Module (CFIM) to realize the hierarchical feature interaction at different scales. Finally, in similarity fusion level, we propose a Similarity Fusion Module (SFM) to aggregate geometry and texture clues from the target. Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges of 3D single - object tracking (3D SOT) in computer vision and autonomous driving. Specifically, the existing mainstream methods mainly rely on point clouds for geometric matching. However, due to the lack of texture and incompleteness of point clouds, it is difficult for unimodal trackers to distinguish objects with similar structures. To solve this problem, the authors propose a multi - modal multi - level fusion tracker (MMF - Track), aiming to combine image texture and point - cloud geometric features to improve the accuracy of 3D object tracking. ### Specific Problems and Solutions 1. **Limitations of Geometric Matching**: - The existing 3D SOT methods mainly rely on point clouds for geometric matching, but point - cloud data are usually sparse and lack texture information, which makes it difficult for unimodal trackers to distinguish objects with similar structures. - **Solution**: MMF - Track improves the accuracy of target identification by introducing image texture information and combining point - cloud geometric features. 2. **Multi - modal Data Alignment**: - Multi - modal trackers need to align data from different sensors and establish associations between search and template areas, which is more complex than 3D detection tasks. - **Solution**: The authors propose a space alignment module (SAM) to align RGB images with point clouds in 3D space, thus laying the foundation for subsequent multi - modal semantic associations. 3. **Feature Interaction and Fusion**: - The feature information at a single scale is limited and it is difficult to construct complex semantic associations. - **Solution**: The authors design a coarse - to - fine interaction module (CFIM) to use multi - scale features to gradually refine and enhance the feature representations of each modality. 4. **Similarity Fusion**: - Previous methods only use the geometric similarity of point clouds and ignore texture information. - **Solution**: The authors propose a similarity fusion module (SFM), which not only generates geometric similarity but also generates texture similarity, and improves the tracking performance by adaptively fusing the two similarities. ### Summary MMF - Track overcomes the limitations of unimodal 3D SOT methods in dealing with texture - missing and incomplete point clouds by combining image texture and point - cloud geometric features. Through modules such as space alignment, feature interaction and similarity fusion, MMF - Track achieves more robust 3D object tracking performance. Experimental results show that this method has achieved excellent performance on the KITTI and NuScenes datasets.

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

MFITrack: Multi-Frame Integration Strategy for Enhanced Motion-Centric Single Object Tracking

Interactive Multi-scale Fusion of 2D and 3D Features for Multi-object Tracking

MF-Net: A Multimodal Fusion Model for Fast Multi-object Tracking

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking from View Aggregation

A Multi-Level Eigenvalue Fusion Algorithm for 3D Multi-Object Tracking

Multi-object tracking via deep feature fusion and association analysis

MSA-MOT: Multi-Stage Association for 3D Multimodality Multi-Object Tracking

Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion

Motion-to-Matching: A Mixed Paradigm for 3D Single Object Tracking

Delving into Motion-Aware Matching for Monocular 3D Object Tracking

Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

OST: Efficient One-stream Network for 3D Single Object Tracking in Point Clouds

MM-Tracker: Visual Tracking with A Multi-Task Model Integrating Detection and Differentiating Feature Extraction

Facilitating 3D Object Tracking in Point Clouds with Image Semantics and Geometry.

CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion