Abstract:Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and previous 2D point tracking methods, even if one uses ground truth depth and camera pose to backproject 2D point tracks in a synthetic scenario.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of long-term 3D point tracking. Specifically, it focuses on how to track arbitrary points over an extended period in dynamic 3D scenes without requiring fine-tuning during testing. Most existing methods primarily concentrate on 2D point tracking. Although these methods perform well in 2D images, their motion estimation is often inaccurate in the 3D physical world. Additionally, existing 3D point tracking methods usually require extensive test-time optimization, limiting their practicality in real-time applications. ### Main Contributions 1. **First proposed online 3D point tracking framework**: This framework can track arbitrary points in 3D without requiring optimization during testing. 2. **Cost Volume Fusion Module**: This module effectively combines long-term appearance information and past motion trajectories of each point, significantly improving tracking performance. 3. **Adaptive Decoding Module**: This module selectively decodes points around the query point, significantly reducing memory consumption, especially when handling dense point clouds, enabling the model to generate more accurate motion predictions. ### Method Overview 1. **Data Input and Preprocessing**: Assuming camera poses and depth information are already obtained (e.g., through a SLAM system), convert the video into a series of point cloud sequences. 2. **Feature Extraction**: Use a U-Net-based backbone network to extract multi-level features of the point cloud. 3. **Cost Volume Construction**: For each query point, construct cost volumes of appearance and motion information over multiple time steps. 4. **Cost Volume Fusion**: Predict the actual motion and occlusion state of each point by combining motion priors and appearance matching information through a novel data-driven cost volume fusion module. 5. **Adaptive Decoding**: Selectively decode points around the query point to reduce computation and memory consumption. 6. **Model Training**: Consists of two stages, first scene flow pre-training, followed by long-term tracking training. ### Experimental Results - **Scene Flow Pre-training**: Experimental results on the FlyingThings dataset show that this framework significantly outperforms existing 2D methods in the scene flow task. - **3D Point Tracking**: Experimental results on the TapVid-Kubric and PointOdyssey datasets show that this method significantly outperforms existing 2D methods and simple scene flow linking methods in the 3D point tracking task. - **Occlusion Accuracy**: In occluded areas, this method's accuracy is significantly higher than other methods, especially when 3D motion priors are insufficient. ### Conclusion This paper proposes a new online 3D point tracking framework capable of long-term tracking of arbitrary points in dynamic 3D scenes without requiring optimization during testing. By introducing the Cost Volume Fusion Module and Adaptive Decoding Module, this method achieves significant performance improvements in 3D point tracking tasks, particularly in occluded areas. These improvements make this method highly practical for applications such as augmented reality and robotic manipulation.

Long-Term 3D Point Tracking By Cost Volume Fusion

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Recurrent Volume-based 3D Feature Fusion for Real-time Multi-view Object Pose Estimation

Temporal Point Cloud Fusion With Scene Flow for Robust 3D Object Tracking

DeepTracking-Net: 3D Tracking with Unsupervised Learning of Continuous Flow

A Novel Object Re-Track Framework for 3D Point Clouds

Monocular Quasi-Dense 3D Object Tracking

3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

Robust Performance-driven 3D Face Tracking in Long Range Depth Scenes.

Facilitating 3D Object Tracking in Point Clouds with Image Semantics and Geometry.

Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping

DELTA: Dense Efficient Long-range 3D Tracking for any video

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

3D Object Tracking with Transformer

STTracker: Spatio-Temporal Tracker for 3D Single Object Tracking

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

3D-SiamRPN: An End-to-End Learning Method for Real-Time 3D Single Object Tracking Using Raw Point Cloud

Fast Encoder-Based 3D from Casual Videos via Point Track Processing

Collaborative Tracking: Dynamically Fusing Short-Term Trackers and Long-Term Detector.

SpatialTracker: Tracking Any 2D Pixels in 3D Space