Abstract:Action recognition is essential for many human-centered applications in the Internet of Things (IoT). Especially, in the Internet of Medical Things (IoMT), action recognition shows great importance in surgical assistance, patient monitoring, etc. Recently, 3-D skeleton sequence-based action recognition draws broad attention. It is a challenging task that needs effective modeling on intraframe skeleton representations and interframe temporal dynamics. Standard long short-term memory (LSTM)-based models are widely used for sequence modeling due to its long-term memory, yet they are unable to fully model the relationship between different body joints or persons to extract crucial co-occurrence features from different levels. To handle this shortcoming, we propose an attention-based multilevel co-occurrence graph convolutional LSTM (AMCGC-LSTM). By integrating graph convolutional networks (GCNs) into LSTM, the proposed model is capable of leveraging body structural information from skeletons and strengthening the multilevel co-occurrence (MC) feature learning. Specifically, we first design the spatial attention module for feature enhancement of key joints from skeleton inputs. Second, we design MC memory units coupled with GCN to automatically model the spatial relationship between joints, and simultaneously capture the co-occurrence features from different joints, persons, and frames. Finally, we construct aggregated features of MCs (AFMCs) from MC memory units to better represent the intraframe action context encoding, and leverage a concurrent LSTM (Co-LSTM) to further model their temporal dynamics for action recognition. Our model significantly outperforms mainstream methods on NTU RGB+D 60/120 data set, mutual action subset of NTU RGB+D 60/120 data set, and Northewestern-UCLA data set.

Efficient spatiotemporal context modeling for action recognition

Efficient Spatialtemporal Context Modeling for Action Recognition

Learning Visual Context for Group Activity Recognition.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

DC3D: A Video Action Recognition Network Based on Dense Connection

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

An efficient attention module for 3d convolutional neural networks in action recognition

Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition

Global Context-Aware Attention LSTM Networks for 3D Action Recognition.

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Spatio-Temporal Attention Networks for Action Recognition and Detection

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection

Spatio-Temporal Collaborative Module for Efficient Action Recognition

CTM: Cross-time temporal module for fine-grained action recognition

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

ACTION-Net: Multipath Excitation for Action Recognition