Abstract:This article aims to tackle the problem of group activity recognition in the multiple-person scene. To model the group activity with multiple persons, most long short-term memory (LSTM)-based methods first learn the person-level action representations by several LSTMs and then integrate all the person-level action representations into the following LSTM to learn the group-level activity representation. This type of solution is a two-stage strategy, which neglects the "host-parasite" relationship between the group-level activity ("host") and person-level actions ("parasite") in spatiotemporal space. To this end, we propose a novel graph LSTM-in-LSTM (GLIL) for group activity recognition by modeling the person-level actions and the group-level activity simultaneously. GLIL is a "host-parasite" architecture, which can be seen as several person LSTMs (P-LSTMs) in the local view or a graph LSTM (G-LSTM) in the global view. Specifically, P-LSTMs model the person-level actions based on the interactions among persons. Meanwhile, G-LSTM models the group-level activity, where the person-level motion information in multiple P-LSTMs is selectively integrated and stored into G-LSTM based on their contributions to the inference of the group activity class. Furthermore, to use the person-level temporal features instead of the person-level static features as the input of GLIL, we introduce a residual LSTM with the residual connection to learn the person-level residual features, consisting of temporal features and static features. Experimental results on two public data sets illustrate the effectiveness of the proposed GLIL compared with state-of-the-art methods.

Group Activity Representation Learning with Long-Short States Predictive Transformer.

Group Activity Representation Learning with Self-supervised Predictive Coding.

Learning Visual Context for Group Activity Recognition.

Contextualized Relation Predictive Model for Self-Supervised Group Activity Representation Learning

Hierarchical Long-Short Transformer for Group Activity Recognition

Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition.

GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer

Learning Group Residual Representation for Group Activity Prediction.

Spatiotemporal Information Complementary Modeling and Group Relationship Reasoning for Group Activity Recognition

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Multi-dimensional convolution transformer for group activity recognition

Detector-Free Weakly Supervised Group Activity Recognition

SPARTAN: Self-supervised Spatiotemporal Transformers Approach to Group Activity Recognition

Group Activity Recognition by Using Effective Multiple Modality Relation Representation with Temporal-Spatial Attention

Host-Parasite: Graph LSTM-in-LSTM for Group Activity Recognition

Progressive Relation Learning for Group Activity Recognition

Multi-level Neural Prompt for Zero-Shot Weakly Supervised Group Activity Recognition

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Hierarchical Deep Temporal Models for Group Activity Recognition

Long- and Short-term Preference Learning with Enhanced Spatial Transformer for Next POI Recommendation

Part Based Interaction Learning for Group Activity Recognition