Abstract:Over the past few years, automatic recognition of human interactions has drawn significant attention from researchers working in the field of Artificial Intelligence (AI). And feature extraction is one of the most critical tasks in developing efficient Human Interaction Recognition (HIR) systems. Moreover, recent researches in computer vision suggest that robust features lead to higher recognition accuracies. Hence, an improved HIR system has been proposed in this paper that combines 2D and 3D features extracted using machine learning and deep learning techniques. These discriminative features result in accurate classification and help avoid misclassification of similar interactions. Ten keyframes have been extracted from each video to reduce computational complexity. Next, these frames have been preprocessed using image normalization and noise removal techniques. The Region Of Interest (ROI), which contains the two humans involved in the interaction, has been extracted using motion detection. Then, the human silhouettes have been segmented using the GrabCut algorithm. Next, the extracted silhouettes have been converted into 3D meshes and their heat kernel signatures (HKS) have been obtained to extract key body points. A Convolutional Neural Network (CNN) has been used to extract full-body features from 2D full-body silhouettes. Then, topological and geometric features have been extracted from the key body points. Finally, the combined feature vector has been fed into Long Short-Term Memory (LSTM) and each interaction has been recognized using a Softmax classifier. The proposed system has been validated via extensive experimentation on three challenging RGB+D datasets. The recognition accuracies of 91.63%, 90.54%, and 90.13% have been achieved with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results of extensive experiments performed on the proposed system suggest that it can be used effectively for various applications, such as security, surveillance, health monitoring, and assisted living.

DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features

Spatial-temporal dual-actor CNN for human interaction prediction in video

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Video Dynamics Detection Using Deep Neural Networks

DTW-NN: A novel neural network for time series recognition using dynamic alignment between inputs and weights

Human Action Recognition From Digital Videos Based on Deep Learning.

An LSTM-Based Approach for Understanding Human Interactions Using Hybrid Feature Descriptors Over Depth Sensors

CNN-Based Time Series Decomposition Model for Video Prediction

Human Interaction Representation and Recognition Through Motion Decomposition.

Dynamic Human Behavior Pattern Detection and Classification

TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction

DMS-GCN: Dynamic Mutiscale Spatiotemporal Graph Convolutional Networks for Human Motion Prediction

Human Activity Recognition Based on Deep-Temporal Learning Using Convolution Neural Networks Features and Bidirectional Gated Recurrent Unit With Features Selection

Dynamic Action Recognition: A convolutional neural network model for temporally organized joint location data

Convolutional Drift Networks for Video Classification

Convolutional Neural Network Bootstrapped by Dynamic Segmentation and Stigmergy-Based Encoding for Real-Time Human Activity Recognition in Smart Homes

Time-Delay Neural Network for Continuous Emotional Dimension Prediction from Facial Expression Sequences.

WiFi-TCN: Temporal Convolution for Human Interaction Recognition based on WiFi signal

Deep Learning-Based Human Action Recognition in Videos

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Joint Motion Information Extraction and Human Behavior Recognition in Video Based on Deep Learning