Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Kristen Grauman,Andrew Westbury,Lorenzo Torresani,Kris Kitani,Jitendra Malik,Triantafyllos Afouras,Kumar Ashutosh,Vijay Baiyya,Siddhant Bansal,Bikram Boote,Eugene Byrne,Zach Chavis,Joya Chen,Feng Cheng,Fu-Jen Chu,Sean Crane,Avijit Dasgupta,Jing Dong,Maria Escobar,Cristhian Forigua,Abrham Gebreselasie,Sanjay Haresh,Jing Huang,Md Mohaiminul Islam,Suyog Jain,Rawal Khirodkar,Devansh Kukreja,Kevin J Liang,Jia-Wei Liu,Sagnik Majumder,Yongsen Mao,Miguel Martin,Effrosyni Mavroudi,Tushar Nagarajan,Francesco Ragusa,Santhosh Kumar Ramakrishnan,Luigi Seminara,Arjun Somayazulu,Yale Song,Shan Su,Zihui Xue,Edward Zhang,Jinxu Zhang,Angela Castillo,Changan Chen,Xinzhu Fu,Ryosuke Furuta,Cristina Gonzalez,Prince Gupta,Jiabo Hu,Yifei Huang,Yiming Huang,Weslie Khoo,Anush Kumar,Robert Kuo,Sach Lakhavani,Miao Liu,Mi Luo,Zhengyi Luo,Brighid Meredith,Austin Miller,Oluwatumininu Oguntola,Xiaqing Pan,Penny Peng,Shraman Pramanick,Merey Ramazanova,Fiona Ryan,Wei Shan,Kiran Somasundaram,Chenan Song,Audrey Southerland,Masatoshi Tateno,Huiyu Wang,Yuchen Wang,Takuma Yagi,Mingfei Yan,Xitong Yang,Zecheng Yu,Shengxin Cindy Zha,Chen Zhao,Ziwei Zhao,Zhifan Zhu,Jeff Zhuo,Pablo Arbelaez,Gedas Bertasius,David Crandall,Dima Damen,Jakob Engel,Giovanni Maria Farinella,Antonino Furnari,Bernard Ghanem,Judy Hoffman,C. V. Jawahar,Richard Newcombe,Hyun Soo Park,James M. Rehg,Yoichi Sato,Manolis Savva,Jianbo Shi,Mike Zheng Shou,Michael Wray
2024-09-26
Abstract:We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: <a class="link-external link-http" href="http://ego-exo4d-data.org/" rel="external noopener nofollow">this http URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing data sets in understanding and modeling human skill activities. Specifically, the current data sets have the following main problems: 1. **Single perspective**: Most existing data sets only focus on the first - person (egocentric) or third - person (exocentric) perspective and lack the ability to capture both perspectives simultaneously. However, the learning and understanding of human skills usually require observation from multiple perspectives to obtain more comprehensive information. 2. **Insufficient scale and diversity**: The existing multi - perspective data sets are small in scale, with limited scenes and activity types, and it is difficult to reflect the diversity and complexity of the real world. For example, many data sets are limited to laboratory environments or specific daily activities and cannot cover a wide range of practical application scenarios. 3. **Single modality**: Existing data sets usually only contain video data and lack other important modality information, such as audio, eye - tracking, 3D point cloud, etc. These modality information are crucial for in - depth understanding of human skill activities. 4. **Inadequate annotation**: Existing data sets have deficiencies in annotation, especially lacking high - quality annotation data in aspects such as fine - grained activity recognition and skill level assessment. To overcome these problems, the paper introduces the **Ego - Exo4D** data set, which is a large - scale, multi - modal, multi - perspective video data set aiming to support in - depth research on human skill activities. The main features of Ego - Exo4D include: - **Large - scale and diversity**: It includes 740 participants from 13 cities around the world and 123 different natural scenes, with a total video duration of 1,286 hours. - **Multi - perspective**: Each sequence simultaneously captures first - person and third - person perspective videos, and all perspectives are time - synchronized and precisely located. - **Multi - modal**: In addition to video data, it also includes rich modality information such as multi - channel audio, eye - tracking, 3D point cloud, camera pose, IMU, etc. - **High - quality annotation**: It provides a variety of high - quality annotation data, including fine - grained activity understanding, skill level assessment, cross - perspective translation, 3D hand and body postures, etc. Through these features, Ego - Exo4D aims to promote the development of first - person video understanding, skill learning, multi - modal perception and other fields, and provide a powerful tool and resource for researchers.