Ego3DT: Tracking Every 3D Object in Ego-centric Videos

Shengyu Hao,Wenhao Chai,Zhonghan Zhao,Meiqi Sun,Wendi Hu,Jieyang Zhou,Yixian Zhao,Qi Li,Yizhou Wang,Xi Li,Gaoang Wang
2024-10-11
Abstract:The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04x - 2.90x in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately locate and track three - dimensional objects in ego - centric videos. Specifically, the main challenges faced by the researchers include: 1. **Large changes in perspective**: Due to the large movement of the camera in ego - centric videos, the perspective changes of objects in different frames are very significant. 2. **Objects frequently enter and exit the field of view**: Movements of the head or hands may cause objects to frequently enter and exit the camera's field of view, and may experience frequent occlusions, scale changes, pose changes, and even appearance changes. 3. **Poor performance of existing methods**: Traditional multi - object tracking (MOT) methods perform poorly when dealing with ego - centric videos, mainly because of the lack of datasets and evaluation criteria specifically for such videos. To solve these problems, the paper proposes a new zero - shot method, Ego3DT, for reconstructing and tracking all three - dimensional objects from ego - centric videos. The following are the main contributions of this method: - **Constructing 3D scenes**: Dynamically construct 3D scenes in the ego - centric perspective through a pre - trained 3D scene reconstruction model. - **Open - vocabulary object tracking**: Only by inputting RGB videos, open - vocabulary object tracking without additional annotations can be achieved. - **Dynamic matching mechanism**: Use the cross - window matching method for 3D position matching to avoid the instability caused by relying solely on 2D image tracking. - **Innovative hierarchical association mechanism**: Create stable 3D tracking trajectories to ensure the continuity and consistency of objects between different frames. These improvements enable Ego3DT to achieve significantly better performance than existing methods on two newly compiled datasets, especially with a 1.04 - to 2.90 - fold improvement in the HOTA metric, demonstrating its robustness and accuracy in diverse ego - centric scenarios. ### Formula summary 1. **3D coordinate transformation**: \[ O_{3D}=G(X, O_{Seg}^{2D}) \] where \(O_{3D}\) is the 3D coordinate matrix, \(G\) is the 3D estimation model, \(X\) is the video frame, and \(O_{Seg}^{2D}\) is the 2D segmentation result. 2. **3D point matching**: \[ Y = M(O_{3D})=PointMatch(A(O_{3D})) \] where \(M\) is the matching module and \(A\) is the 3D scene registration method. 3. **Sliding window mechanism**: \[ S = W - T \] where \(W\) is the window size, \(T\) is the overlap size, and \(S\) is the step distance. 4. **Optimizing homeomorphic transformation**: \[ H_{t}^*=\arg\min_{H_{t}}\sum_{t = 1}^{A}\|O_{t - 1}^{3D}-H_{t}O_{t}^{3D}\|^2 \] where \(H_{t}\) is the homeomorphic matrix and \(A\) is the total number of matching points. Through these formulas and methods, Ego3DT effectively solves the problem of object tracking in ego - centric videos and provides more stable and accurate tracking results.