Long-Term 3D Point Tracking By Cost Volume Fusion

Hung Nguyen,Chanho Kim,Rigved Naukarkar,Li Fuxin
2024-07-18
Abstract:Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and previous 2D point tracking methods, even if one uses ground truth depth and camera pose to backproject 2D point tracks in a synthetic scenario.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of long-term 3D point tracking. Specifically, it focuses on how to track arbitrary points over an extended period in dynamic 3D scenes without requiring fine-tuning during testing. Most existing methods primarily concentrate on 2D point tracking. Although these methods perform well in 2D images, their motion estimation is often inaccurate in the 3D physical world. Additionally, existing 3D point tracking methods usually require extensive test-time optimization, limiting their practicality in real-time applications. ### Main Contributions 1. **First proposed online 3D point tracking framework**: This framework can track arbitrary points in 3D without requiring optimization during testing. 2. **Cost Volume Fusion Module**: This module effectively combines long-term appearance information and past motion trajectories of each point, significantly improving tracking performance. 3. **Adaptive Decoding Module**: This module selectively decodes points around the query point, significantly reducing memory consumption, especially when handling dense point clouds, enabling the model to generate more accurate motion predictions. ### Method Overview 1. **Data Input and Preprocessing**: Assuming camera poses and depth information are already obtained (e.g., through a SLAM system), convert the video into a series of point cloud sequences. 2. **Feature Extraction**: Use a U-Net-based backbone network to extract multi-level features of the point cloud. 3. **Cost Volume Construction**: For each query point, construct cost volumes of appearance and motion information over multiple time steps. 4. **Cost Volume Fusion**: Predict the actual motion and occlusion state of each point by combining motion priors and appearance matching information through a novel data-driven cost volume fusion module. 5. **Adaptive Decoding**: Selectively decode points around the query point to reduce computation and memory consumption. 6. **Model Training**: Consists of two stages, first scene flow pre-training, followed by long-term tracking training. ### Experimental Results - **Scene Flow Pre-training**: Experimental results on the FlyingThings dataset show that this framework significantly outperforms existing 2D methods in the scene flow task. - **3D Point Tracking**: Experimental results on the TapVid-Kubric and PointOdyssey datasets show that this method significantly outperforms existing 2D methods and simple scene flow linking methods in the 3D point tracking task. - **Occlusion Accuracy**: In occluded areas, this method's accuracy is significantly higher than other methods, especially when 3D motion priors are insufficient. ### Conclusion This paper proposes a new online 3D point tracking framework capable of long-term tracking of arbitrary points in dynamic 3D scenes without requiring optimization during testing. By introducing the Cost Volume Fusion Module and Adaptive Decoding Module, this method achieves significant performance improvements in 3D point tracking tasks, particularly in occluded areas. These improvements make this method highly practical for applications such as augmented reality and robotic manipulation.