Abstract:Egocentric videos, which record the daily activities of individuals from a first-person point of view, have attracted increasing attention during recent years because of their growing use in many popular applications, including life logging, health monitoring and virtual reality. As a fundamental problem in egocentric vision, one of the tasks of egocentric action recognition aims to recognize the actions of the camera wearers from egocentric videos. In egocentric action recognition, relation modeling is important, because the interactions between the camera wearer and the recorded persons or objects form complex relations in egocentric videos. However, only a few of existing methods model the relations between the camera wearer and the interacting persons for egocentric action recognition, and moreover they require prior knowledge or auxiliary data to localize the interacting persons. In this work, we consider modeling the relations in a weakly supervised manner, i.e., without using annotations or prior knowledge about the interacting persons or objects, for egocentric action recognition. We form a weakly supervised framework by unifying automatic interactor localization and explicit relation modeling for the purpose of automatic relation modeling. First, we learn to automatically localize the interactors, i.e., the body parts of the camera wearer and the persons or objects that the camera wearer interacts with, by learning a series of keypoints directly from video data to localize the action-relevant regions with only action labels and some constraints on these keypoints. Second, more importantly, to explicitly model the relations between the interactors, we develop an ego-relational LSTM (long short-term memory) network with several candidate connections to model the complex relations in egocentric videos, such as the temporal, interactive, and contextual relations. In particular, to reduce human efforts and manual interventions needed to construct an optimal ego-relational LSTM structure, we search for the optimal connections by employing a differentiable network architecture search mechanism, which automatically constructs the ego-relational LSTM network to explicitly model different relations for egocentric action recognition. We conduct extensive experiments on egocentric video datasets to illustrate the effectiveness of our method.

Pattern4Ego: Learning Egocentric Video Representation Using Cross-video Activity Patterns

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

Egocentric Action Recognition by Automatic Relation Modeling.

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Ego3DT: Tracking Every 3D Object in Ego-centric Videos

Interactive Prototype Learning for Egocentric Action Recognition

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition

EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives

Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset

EgoEnv: Human-centric environment representations from egocentric video

Intention-driven Ego-to-Exo Video Generation

Retrieval-Augmented Egocentric Video Captioning

Ego-Body Pose Estimation via Ego-Head Pose Estimation

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

EgoAvatar: Egocentric View-Driven and Photorealistic Full-body Avatars

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction