Abstract:Real-time Human action classification in complex scenes has applications in various domains such as visual surveillance, video retrieval and human robot interaction. While, the task is challenging due to computation efficiency, cluttered backgrounds and intro-variability among same type of actions. Spatio-temporal interest point (STIP) based methods have shown promising results to tackle human action classification in complex scenes efficiently. However, the state-of-the-art works typically utilize bag-of-visual words (BoVW) model which only focuses on the word distribution of STIPs and ignore the distinctive character of word structure. In this paper, the distribution of STIPs is organized into a salient directed graph, which reflects salient motions and can be divided into a time salient directed graph and a space salient directed graph, aiming at adding spatio-temporal discriminant to BoVW. Generally speaking, both salient directed graphs are constructed by labeled STIPs in pairs. In detail, the "directional co-occurrence" property of different labeled pairwise STIPs in same frame is utilized to represent the time saliency, and the space saliency is reflected by the "geometric relationships" between same labeled pairwise STIPs across different frames. Then, new statistical features namely the Time Salient Pairwise feature (TSP) and the Space Salient Pairwise feature (SSP) are designed to describe two salient directed graphs, respectively. Experiments are carried out with a homogeneous kernel SVM classifier, on four challenging datasets KTH, ADL and UT-Interaction. Final results confirm the complementary of TSP and SSP, and our multi-cue representation TSP + SSP + BoVW can properly describe human actions with large intro-variability in real-time. Copyright (C) 2016, Chongqing University of Technology. Production and hosting by Elsevier B.V.

Spatio-Temporal Deep Q-Networks for Human Activity Localization

Lstm With Uniqueness Attention For Human Activity Recognition

Spatiotemporal Multi-Task Network for Human Activity Understanding.

Local Spatio-Temporal Feature Based Voting Framework for Complex Human Activity Detection and Localization

ZSTAD: Zero-Shot Temporal Activity Detection

A Hierarchical Spatio-Temporal Model for Human Activity Recognition.

A weakly supervised CNN model for spatial localization of human activities in unconstraint environment

Temporal Context Network for Activity Localization in Videos

Spatio-Temporal Attention Networks for Action Recognition and Detection

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

Modeling Spatio-Temporal Human Track Structure for Action Localization

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Salient Pairwise Spatio-Temporal Interest Points for Real-Time Activity Recognition.

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

Learning to Track for Spatio-Temporal Action Localization

Tracking Objects and Activities with Attention for Temporal Sentence Grounding

Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization

Exploring Frame Segmentation Networks for Temporal Action Localization

Spatiotemporal Interaction Residual Networks with Pseudo3D for Video Action Recognition.

Bi-STAN: bilinear spatial-temporal attention network for wearable human activity recognition