Abstract:Over the past few years, automatic recognition of human interactions has drawn significant attention from researchers working in the field of Artificial Intelligence (AI). And feature extraction is one of the most critical tasks in developing efficient Human Interaction Recognition (HIR) systems. Moreover, recent researches in computer vision suggest that robust features lead to higher recognition accuracies. Hence, an improved HIR system has been proposed in this paper that combines 2D and 3D features extracted using machine learning and deep learning techniques. These discriminative features result in accurate classification and help avoid misclassification of similar interactions. Ten keyframes have been extracted from each video to reduce computational complexity. Next, these frames have been preprocessed using image normalization and noise removal techniques. The Region Of Interest (ROI), which contains the two humans involved in the interaction, has been extracted using motion detection. Then, the human silhouettes have been segmented using the GrabCut algorithm. Next, the extracted silhouettes have been converted into 3D meshes and their heat kernel signatures (HKS) have been obtained to extract key body points. A Convolutional Neural Network (CNN) has been used to extract full-body features from 2D full-body silhouettes. Then, topological and geometric features have been extracted from the key body points. Finally, the combined feature vector has been fed into Long Short-Term Memory (LSTM) and each interaction has been recognized using a Softmax classifier. The proposed system has been validated via extensive experimentation on three challenging RGB+D datasets. The recognition accuracies of 91.63%, 90.54%, and 90.13% have been achieved with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results of extensive experiments performed on the proposed system suggest that it can be used effectively for various applications, such as security, surveillance, health monitoring, and assisted living.

Spatial-temporal dual-actor CNN for human interaction prediction in video

DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Long-term Residual Recurrent Network for Human Interaction Recognition in Videos

Predicting Human Interaction Via Relative Attention Model.

Human Behavior Recognition Based on CNN-LSTM Hybrid and Multi-Sensing Feature Information Fusion

Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Human Interaction Recognition Based on Whole-Individual Detection.

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Spatiotemporal Multi-Task Network for Human Activity Understanding.

Recognising human interaction from videos by a discriminative model

High Efficient LSTM-based Network for Human Interaction Understanding

Qualitative Prediction of Multi-Agent Spatial Interactions

An LSTM-Based Approach for Understanding Human Interactions Using Hybrid Feature Descriptors Over Depth Sensors

Unified Spatio-Temporal Attention Models for Advanced Human Action Recognition & Detection

Spatial-temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition

A Hierarchical Spatio-Temporal Model for Human Activity Recognition.

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Spatio-Temporal Graph Dual-Attention Network for Multi-Agent Prediction and Tracking

Efficient Modelling Across Time of Human Actions and Interactions