Delving into Motion-Aware Matching for Monocular 3D Object Tracking

Kuan-Chih Huang,Ming-Hsuan Yang,Yi-Hsuan Tsai
2023-08-23
Abstract:Recent advances of monocular 3D object detection facilitate the 3D multi-object tracking task based on low-cost camera sensors. In this paper, we find that the motion cue of objects along different time frames is critical in 3D multi-object tracking, which is less explored in existing monocular-based approaches. In this paper, we propose a motion-aware framework for monocular 3D MOT. To this end, we propose MoMA-M3T, a framework that mainly consists of three motion-aware components. First, we represent the possible movement of an object related to all object tracklets in the feature space as its motion features. Then, we further model the historical object tracklet along the time frame in a spatial-temporal perspective via a motion transformer. Finally, we propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results. We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Moreover, the proposed tracker is flexible and can be easily plugged into existing image-based 3D object detectors without re-training. Code and models are available at <a class="link-external link-https" href="https://github.com/kuanchihhuang/MoMA-M3T" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize the motion cues of objects in different time frames to improve the tracking performance in monocular 3D multi - object tracking (3D MOT). Specifically, the existing monocular - based 3D MOT methods encounter difficulties in dealing with inaccurate and noisy predictions in multi - frame observations. To address this challenge, the authors propose a framework named MoMA - M3T, which models the motion representation between object trajectories and detection results by introducing a motion - aware matching mechanism, and designs motion - aware modules (such as motion transformers and motion - aware matching modules) to assist the learning process. ### Main Contributions 1. **Propose the MoMA - M3T Framework**: This framework introduces motion features and a motion - aware matching mechanism for monocular 3D multi - object tracking. 2. **Motion Transformer Module**: This module captures the motion behavior of object trajectories from a spatio - temporal perspective, achieving robust motion feature learning. 3. **Experimental Verification**: Extensive experiments on the nuScenes and KITTI datasets show that this method achieves comparable or even better performance in 3D MOT tasks based on monocular sensors compared to existing methods, and has the ability to flexibly apply various pre - trained 3D detectors. ### Method Overview 1. **Motion Feature Generation**: - **Motion Representation**: Represent the motion state of an object by calculating the relative motion vectors between the object in different time frames. - **Trajectory - Conditioned Motion Features**: For each detection result in the current frame, calculate the relative motion between it and the latest positions of all active trajectories, and generate trajectory - conditioned motion features. - **Motion Feature Library**: Maintain historical motion features and global 3D positions for updating after tracking association in each frame. 2. **Motion Transformer**: - **Input and Time Encoding**: Extract the latest T - frame features of each trajectory from the motion feature library and add learnable time - position embeddings to capture time cues. - **Time Encoder**: Use a transformer model to extract the time information of each trajectory and generate motion tokens that reflect the motion representation of the object. - **Space Encoder**: Further introduce the absolute position of the object as global information to capture the spatial dependencies between trajectories. 3. **Motion - Aware Matching Learning**: - **Matching Learning**: Calculate the matching scores between detection results and trajectories through MLP layers and the sigmoid function, and train with binary focal loss. - **Contrastive Motion Feature Learning**: Through a contrastive learning strategy, encourage the motion features from the same trajectory to be similar, while those from different trajectories to be dissimilar. ### Experimental Results - **nuScenes Dataset**: In the monocular setting, MoMA - M3T outperforms the existing monocular 3D MOT methods in multiple evaluation metrics, especially outstanding in the AMOTA metric. - **KITTI Dataset**: Compared with different monocular methods, MoMA - M3T also achieves good performance in the 3D MOT task of the car category. ### Conclusion By introducing the motion - aware matching mechanism and the motion transformer module, MoMA - M3T effectively solves the problem of utilizing motion cues in monocular 3D multi - object tracking, improving the accuracy and robustness of tracking.