Abstract:Egocentric videos, which record the daily activities of individuals from a first-person point of view, have attracted increasing attention during recent years because of their growing use in many popular applications, including life logging, health monitoring and virtual reality. As a fundamental problem in egocentric vision, one of the tasks of egocentric action recognition aims to recognize the actions of the camera wearers from egocentric videos. In egocentric action recognition, relation modeling is important, because the interactions between the camera wearer and the recorded persons or objects form complex relations in egocentric videos. However, only a few of existing methods model the relations between the camera wearer and the interacting persons for egocentric action recognition, and moreover they require prior knowledge or auxiliary data to localize the interacting persons. In this work, we consider modeling the relations in a weakly supervised manner, i.e., without using annotations or prior knowledge about the interacting persons or objects, for egocentric action recognition. We form a weakly supervised framework by unifying automatic interactor localization and explicit relation modeling for the purpose of automatic relation modeling. First, we learn to automatically localize the interactors, i.e., the body parts of the camera wearer and the persons or objects that the camera wearer interacts with, by learning a series of keypoints directly from video data to localize the action-relevant regions with only action labels and some constraints on these keypoints. Second, more importantly, to explicitly model the relations between the interactors, we develop an ego-relational LSTM (long short-term memory) network with several candidate connections to model the complex relations in egocentric videos, such as the temporal, interactive, and contextual relations. In particular, to reduce human efforts and manual interventions needed to construct an optimal ego-relational LSTM structure, we search for the optimal connections by employing a differentiable network architecture search mechanism, which automatically constructs the ego-relational LSTM network to explicitly model different relations for egocentric action recognition. We conduct extensive experiments on egocentric video datasets to illustrate the effectiveness of our method.

Trear: Transformer-Based RGB-D Egocentric Action Recognition

Action Recognition In Rgb-D Egocentric Videos

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Egocentric RGB+Depth Action Recognition in Industry-Like Settings

Deep Attention Network for Egocentric Action Recognition.

3D Action Recognition Using Multi-scale Energy-based Global Ternary Image

EventTransAct: A video transformer-based framework for Event-camera based action recognition

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

CAST: Cross-Attention in Space and Time for Video Action Recognition

Typing Video frames after person detection Pose Tube 2 D Deconv Score fusion RGB action recognition Pose action recognition Pose estimation

ARCTIC: A knowledge distillation approach via attention-based relation matching and activation region constraint for RGB-to-Infrared videos action recognition

Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

Enhanced Attention Tracking with Multi-Branch Network for Egocentric Activity Recognition

Action Recognition and Benchmark Using Event Cameras.

Egocentric Action Recognition by Automatic Relation Modeling.

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

RGB-D Based Action Recognition with Light-weight 3D Convolutional Networks