Abstract:Over the past few years, automatic recognition of human interactions has drawn significant attention from researchers working in the field of Artificial Intelligence (AI). And feature extraction is one of the most critical tasks in developing efficient Human Interaction Recognition (HIR) systems. Moreover, recent researches in computer vision suggest that robust features lead to higher recognition accuracies. Hence, an improved HIR system has been proposed in this paper that combines 2D and 3D features extracted using machine learning and deep learning techniques. These discriminative features result in accurate classification and help avoid misclassification of similar interactions. Ten keyframes have been extracted from each video to reduce computational complexity. Next, these frames have been preprocessed using image normalization and noise removal techniques. The Region Of Interest (ROI), which contains the two humans involved in the interaction, has been extracted using motion detection. Then, the human silhouettes have been segmented using the GrabCut algorithm. Next, the extracted silhouettes have been converted into 3D meshes and their heat kernel signatures (HKS) have been obtained to extract key body points. A Convolutional Neural Network (CNN) has been used to extract full-body features from 2D full-body silhouettes. Then, topological and geometric features have been extracted from the key body points. Finally, the combined feature vector has been fed into Long Short-Term Memory (LSTM) and each interaction has been recognized using a Softmax classifier. The proposed system has been validated via extensive experimentation on three challenging RGB+D datasets. The recognition accuracies of 91.63%, 90.54%, and 90.13% have been achieved with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results of extensive experiments performed on the proposed system suggest that it can be used effectively for various applications, such as security, surveillance, health monitoring, and assisted living.

Exploiting temporal information to detect conversational groups in videos and predict the next speaker

Temporal encoded F-formation system for social interaction detection.

Recognizing Conversational Interaction Based On 3d Human Pose

Hierarchical Deep Temporal Models for Group Activity Recognition

Spatial-temporal dual-actor CNN for human interaction prediction in video

Interpretable prediction of brain activity during conversations from multimodal behavioral signals

Modeling social interaction dynamics using temporal graph networks

Hierarchical Long Short-Term Concurrent Memory for Human Interaction Recognition

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Long-term Residual Recurrent Network for Human Interaction Recognition in Videos

CERN: Confidence-Energy Recurrent Network for Group Activity Recognition

Automatic detection of interaction groups

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

An LSTM-Based Approach for Understanding Human Interactions Using Hybrid Feature Descriptors Over Depth Sensors

Person Detection in Collaborative Group Learning Environments Using Multiple Representations

Real-Time Multimodal Turn-taking Prediction to Enhance Cooperative Dialogue during Human-Agent Interaction

Detecting events and key actors in multi-person videos

Egocentric Auditory Attention Localization in Conversations

Social behavior recognition in continuous video

Online Recognition of Group Actions in Intelligent Meeting Scenario

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech