Abstract:The growing interest in embodied intelligence has brought ego-centric perspectives to contemporary research. One significant challenge within this realm is the accurate localization and tracking of objects in ego-centric videos, primarily due to the substantial variability in viewing angles. Addressing this issue, this paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. Utilizing information from adjacent video frames, Ego3DT dynamically constructs a 3D scene of the ego view using a pre-trained 3D scene reconstruction model. Additionally, we have innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos. Moreover, the efficacy of our approach is corroborated by extensive experiments on two newly compiled datasets, with 1.04x - 2.90x in HOTA, showcasing the robustness and accuracy of our method in diverse ego-centric scenarios.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accurately locate and track three - dimensional objects in ego - centric videos. Specifically, the main challenges faced by the researchers include: 1. **Large changes in perspective**: Due to the large movement of the camera in ego - centric videos, the perspective changes of objects in different frames are very significant. 2. **Objects frequently enter and exit the field of view**: Movements of the head or hands may cause objects to frequently enter and exit the camera's field of view, and may experience frequent occlusions, scale changes, pose changes, and even appearance changes. 3. **Poor performance of existing methods**: Traditional multi - object tracking (MOT) methods perform poorly when dealing with ego - centric videos, mainly because of the lack of datasets and evaluation criteria specifically for such videos. To solve these problems, the paper proposes a new zero - shot method, Ego3DT, for reconstructing and tracking all three - dimensional objects from ego - centric videos. The following are the main contributions of this method: - **Constructing 3D scenes**: Dynamically construct 3D scenes in the ego - centric perspective through a pre - trained 3D scene reconstruction model. - **Open - vocabulary object tracking**: Only by inputting RGB videos, open - vocabulary object tracking without additional annotations can be achieved. - **Dynamic matching mechanism**: Use the cross - window matching method for 3D position matching to avoid the instability caused by relying solely on 2D image tracking. - **Innovative hierarchical association mechanism**: Create stable 3D tracking trajectories to ensure the continuity and consistency of objects between different frames. These improvements enable Ego3DT to achieve significantly better performance than existing methods on two newly compiled datasets, especially with a 1.04 - to 2.90 - fold improvement in the HOTA metric, demonstrating its robustness and accuracy in diverse ego - centric scenarios. ### Formula summary 1. **3D coordinate transformation**: \[ O_{3D}=G(X, O_{Seg}^{2D}) \] where \(O_{3D}\) is the 3D coordinate matrix, \(G\) is the 3D estimation model, \(X\) is the video frame, and \(O_{Seg}^{2D}\) is the 2D segmentation result. 2. **3D point matching**: \[ Y = M(O_{3D})=PointMatch(A(O_{3D})) \] where \(M\) is the matching module and \(A\) is the 3D scene registration method. 3. **Sliding window mechanism**: \[ S = W - T \] where \(W\) is the window size, \(T\) is the overlap size, and \(S\) is the step distance. 4. **Optimizing homeomorphic transformation**: \[ H_{t}^*=\arg\min_{H_{t}}\sum_{t = 1}^{A}\|O_{t - 1}^{3D}-H_{t}O_{t}^{3D}\|^2 \] where \(H_{t}\) is the homeomorphic matrix and \(A\) is the total number of matching points. Through these formulas and methods, Ego3DT effectively solves the problem of object tracking in ego - centric videos and provides more stable and accurate tracking results.

Ego3DT: Tracking Every 3D Object in Ego-centric Videos

3D-Aware Instance Segmentation and Tracking in Egocentric Videos

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

Instance Tracking in 3D Scenes from Egocentric Videos

EgoLifter: Open-world 3D Segmentation for Egocentric Perception

EgoTracks: A Long-term Egocentric Visual Object Tracking Dataset

EgoHumans: An Egocentric 3D Multi-Human Benchmark

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

EgoEnv: Human-centric environment representations from egocentric video

Spatial Cognition from Egocentric Video: Out of Sight, Not Out of Mind

Ego-Body Pose Estimation via Ego-Head Pose Estimation

3D Human Pose Perception from Egocentric Stereo Videos

Ego+X: an Egocentric Vision System for Global 3D Human Pose Estimation and Social Interaction Characterization

Hexamethyidisiloxane: A 13-week subchronic whole-body vapor inhalation toxicity study in Fischer 344 rats.

Egocentric Audio-Visual Object Localization

EgoLocate: Real-time Motion Capture, Localization, and Mapping with Sparse Body-mounted Sensors

Scene-aware Egocentric 3D Human Pose Estimation