Abstract:The popularity of wearable devices has increased the demands for the research on first-person activity recognition. However, most of the current first-person activity datasets are built based on the assumption that only the human-object interaction (HOI) activities, performed by the camera-wearer, are captured in the field of view. Since humans live in complicated scenarios, in addition to the first-person activities, it is likely that third-person activities performed by other people also appear. Analyzing and recognizing these two types of activities simultaneously occurring in a scene is important for the camera-wearer to understand the surrounding environments. To facilitate the research on concurrent first- and third-person activity recognition (CFT-AR), we first created a new activity dataset, namely PolyU concurrent first- and third-person (CFT) Daily, which exhibits distinct properties and challenges, compared with previous activity datasets. Since temporal asynchronism and appearance gap usually exist between the first- and third-person activities, it is crucial to learn robust representations from all the activity-related spatio-temporal positions. Thus, we explore both holistic scene-level and local instance-level (person-level) features to provide comprehensive and discriminative patterns for recognizing both first- and third-person activities. On the one hand, the holistic scene-level features are extracted by a 3-D convolutional neural network, which is trained to mine shared and sample-unique semantics between video pairs, via two well-designed attention-based modules and a self-knowledge distillation (SKD) strategy. On the other hand, we further leverage the extracted holistic features to guide the learning of instance-level features in a disentangled fashion, which aims to discover both spatially conspicuous patterns and temporally varied, yet critical, cues. Experimental results on the PolyU CFT Daily dataset validate that our method achieves the state-of-the-art performance.

Global and Local C3D Ensemble System for First Person Interactive Action Recognition.

DC3D: A Video Action Recognition Network Based on Dense Connection

Deep Attention Network for Egocentric Action Recognition.

Online Robust Action Recognition Based on a Hierarchical Model

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition Using Deep Learning Methods.

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Global Context-Aware Attention LSTM Networks for 3D Action Recognition.

Action Recognition In Rgb-D Egocentric Videos

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Action Recognition and Localization with Instance FCNN

Learning Comprehensive Motion Representation for Action Recognition

Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition

Multi-cue based four-stream 3D ResNets for video-based action recognition

A Novel 3D Human Action Recognition Framework for Video Content Analysis.

Generic Enhanced Ensemble Learning with Multi-Level Kinematic Constraints for 3D Action Recognition

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization

Holistic-Guided Disentangled Learning with Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition.

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition