EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

Yuan-Ming Li,Wei-Jin Huang,An-Lan Wang,Ling-An Zeng,Jing-Ke Meng,Wei-Shi Zheng

2024-07-16

Abstract:We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement--including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of "what", "when", and "how well". To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action classification, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Code and data will be available at <a class="link-external link-https" href="https://github.com/iSEE-Laboratory/EgoExo-Fitness/tree/main" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the following key issues: 1. **Developing a new multi-view full-body action understanding dataset**: Researchers have created a new dataset called EgoExo-Fitness, which includes video sequences of fitness activities recorded from synchronized first-person (ego-centric) and third-person (exo-centric) cameras. This dataset not only provides rich annotation information, including temporal boundaries of actions and interpretable evaluations of action execution quality (such as key technical point verification, natural language comments, and action quality scores), but it is also the first full-body action understanding dataset covering both first-person and third-person perspectives. 2. **Advancing research in full-body action understanding**: By introducing the EgoExo-Fitness dataset, researchers aim to promote the study of full-body action understanding from both first-person and third-person perspectives, particularly in the dimensions of "what was done," "when it was done," and "how well it was done." 3. **Constructing benchmark tasks**: To facilitate future related research, the authors have also constructed a series of benchmark tasks, including action classification, action localization, cross-view sequence verification, cross-view skill assessment, and a newly proposed task—guided execution verification. These tasks are designed to evaluate the models' ability to understand and assess action execution from different perspectives. 4. **Filling the gaps in existing datasets**: Existing full-body action understanding datasets mainly rely on data captured by third-person cameras, while existing first-person video datasets focus more on desktop activities or daily interactions, with little attention to full-body action understanding from a first-person perspective. The EgoExo-Fitness dataset fills this gap and provides researchers with a unique resource to explore cross-view full-body action understanding issues. In summary, the core objective of this paper is to advance the research progress of full-body action understanding from both first-person and third-person perspectives by introducing the EgoExo-Fitness dataset and to provide a series of benchmark tasks for such research.

EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset

E3V-K5: an Authentic Benchmark for Redefining Video-Based Energy Expenditure Estimation

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Ego-Deliver: A Large-Scale Dataset for Egocentric Video Analysis.

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

Ego-Body Pose Estimation via Ego-Head Pose Estimation

Ego4D: Around the World in 3,000 Hours of Egocentric Video

EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition.

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding

EgoHumans: An Egocentric 3D Multi-Human Benchmark

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Egocentric Action Recognition by Automatic Relation Modeling.

EV-Action: Electromyography-Vision Multi-Modal Action Dataset

Desktop Action Recognition from First-Person Point-of-View

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

4D Human Body Capture from Egocentric Video via 3D Scene Grounding