DELTA: Dense Efficient Long-range 3D Tracking for any video

Tuan Duc Ngo,Peiye Zhuang,Chuang Gan,Evangelos Kalogerakis,Sergey Tulyakov,Hsin-Ying Lee,Chaoyang Wang
2024-11-02
Abstract:Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the challenge of dense 3D motion tracking from monocular video, particularly in long sequences requiring pixel-level accuracy. Specifically, the paper introduces a new method called DELTA (Dense Efficient Long-range 3D Tracking for Any Video), which can efficiently track each pixel in 3D space, achieving precise motion estimation throughout the video. ### Main Problems and Challenges 1. **Dense 3D Motion Tracking**: - Existing methods face issues of low computational efficiency or sparse tracking when dealing with dense 3D motion tracking. - Pixel-level accuracy tracking in long sequences is particularly challenging due to the need to handle 3D to 2D projection, occlusion, camera motion, and dynamic scene changes simultaneously. 2. **Long-term Consistency**: - Early methods mainly focused on predicting dense motion for adjacent frames or short sequences, but these methods often struggle to capture long-term motion. - Point tracking methods can establish correspondences over hundreds of frames but are limited to sparse pixels. 3. **Computational Efficiency**: - Existing methods have high computational costs when handling dense tracking, especially in high-resolution videos. ### Main Contributions of DELTA 1. **Efficient Coarse-to-Fine Strategy**: - Utilizes a space-time attention mechanism to perform coarse tracking at reduced resolution, followed by an attention-based upsampler for high-resolution prediction. 2. **Global and Local Spatial Attention Mechanism**: - Designed an efficient global and local spatial attention architecture that enables end-to-end training while maintaining fine-grained spatial relationships. 3. **Optimization of Depth Representation**: - Experiments revealed that log-depth representation performs best in 3D tracking, significantly improving tracking accuracy. ### Experimental Results - **2D and 3D Dense Tracking Tasks**: - Extensive experiments were conducted on multiple benchmark datasets, including CVO, Kubric3D, and LSFOdyssey, where DELTA achieved new state-of-the-art results in these tasks. - Compared to existing methods, DELTA improved speed by more than 8 times while maintaining high accuracy. - **3D Point Tracking Tasks**: - DELTA also performed excellently on traditional 3D point tracking benchmarks such as TAP-Vid3D and LSFOdyssey. In summary, through its innovative design and optimization, DELTA addresses the computational efficiency and accuracy issues in existing methods for dense 3D motion tracking, providing a powerful solution for practical applications.