EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Yuan-Ming Li,Wei-Jin Huang,An-Lan Wang,Ling-An Zeng,Jing-Ke Meng,Wei-Shi Zheng
2024-07-16
Abstract:We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement--including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of "what", "when", and "how well". To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action classification, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Code and data will be available at <a class="link-external link-https" href="https://github.com/iSEE-Laboratory/EgoExo-Fitness/tree/main" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the following key issues: 1. **Developing a new multi-view full-body action understanding dataset**: Researchers have created a new dataset called EgoExo-Fitness, which includes video sequences of fitness activities recorded from synchronized first-person (ego-centric) and third-person (exo-centric) cameras. This dataset not only provides rich annotation information, including temporal boundaries of actions and interpretable evaluations of action execution quality (such as key technical point verification, natural language comments, and action quality scores), but it is also the first full-body action understanding dataset covering both first-person and third-person perspectives. 2. **Advancing research in full-body action understanding**: By introducing the EgoExo-Fitness dataset, researchers aim to promote the study of full-body action understanding from both first-person and third-person perspectives, particularly in the dimensions of "what was done," "when it was done," and "how well it was done." 3. **Constructing benchmark tasks**: To facilitate future related research, the authors have also constructed a series of benchmark tasks, including action classification, action localization, cross-view sequence verification, cross-view skill assessment, and a newly proposed task—guided execution verification. These tasks are designed to evaluate the models' ability to understand and assess action execution from different perspectives. 4. **Filling the gaps in existing datasets**: Existing full-body action understanding datasets mainly rely on data captured by third-person cameras, while existing first-person video datasets focus more on desktop activities or daily interactions, with little attention to full-body action understanding from a first-person perspective. The EgoExo-Fitness dataset fills this gap and provides researchers with a unique resource to explore cross-view full-body action understanding issues. In summary, the core objective of this paper is to advance the research progress of full-body action understanding from both first-person and third-person perspectives by introducing the EgoExo-Fitness dataset and to provide a series of benchmark tasks for such research.