SpatialTracker: Tracking Any 2D Pixels in 3D Space

Yuxi Xiao,Qianqian Wang,Shangzhan Zhang,Nan Xue,Sida Peng,Yujun Shen,Xiaowei Zhou
2024-04-06
Abstract:Recovering dense and long-range pixel motion in videos is a challenging problem. Part of the difficulty arises from the 3D-to-2D projection process, leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be intricate, we posit that the underlying 3D motion can often be simple and low-dimensional. In this work, we propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators, represents the 3D content of each frame efficiently using a triplane representation, and performs iterative updates using a transformer to estimate 3D trajectories. Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts. Extensive evaluation shows that our approach achieves state-of-the-art tracking performance both qualitatively and quantitatively, particularly in challenging scenarios such as out-of-plane rotation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of recovering dense and long-range pixel motion in videos, which is a challenging issue. Part of the difficulty arises from the 3D to 2D projection process, leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be very complex, the authors believe that the underlying 3D motion is often simple and low-dimensional. Specifically, the paper proposes a method called SpatialTracker, which alleviates the issues brought by image projection by lifting 2D pixels to 3D space and tracking them in 3D space. This method utilizes a monocular depth estimator to lift 2D pixels to 3D and uses a triplane representation to efficiently represent the 3D content of each frame. Then, it iteratively updates the 3D trajectories using a transformer. Tracking in 3D space can leverage as-rigid-as-possible (ARAP) constraints while learning a rigid embedding that clusters pixels into different rigid parts. The main contributions of the paper include: 1. **Tracking in 3D space**: By lifting 2D pixels to 3D space and utilizing 3D contextual information, tracking performance is improved, especially when dealing with occlusions and complex motions. 2. **Triplane representation**: Using triplane feature maps to represent the 3D scene of each frame provides a compact and regular representation suitable for the learning framework. 3. **ARAP constraints**: By introducing ARAP constraints, spatial consistency is enhanced, particularly helping to predict motion in occlusion scenarios. 4. **Excellent experimental results**: Achieving state-of-the-art performance on multiple public tracking benchmarks, particularly excelling in handling complex deformations and frequent self-occlusions. Overall, the paper aims to overcome the limitations of existing methods in handling complex motion and occlusions by tracking in 3D space, thereby achieving more accurate and robust pixel tracking.