Abstract:Group activity recognition aims to infer group activity in multi-person scenes. Previous methods usually model inter-person relations and integrate individuals' features into group representations. However, they neglect intra-person relations contained in the human skeleton. Individual representations can also be inferred by analyzing the evolution of human skeletons. In this paper, we utilize RGB images and human skeletons as the inputs which contain complementary information. Considering different semantic attributes of the two inputs, we design two diverse branches, respectively. For RGB images, we propose Scene Encoded Transformer, Spatial Transformer, and Temporal Transformer to explore inter-person spatial and temporal relations. For skeleton inputs, we capture the intra-person spatial and temporal dynamics by designing Spatial and Temporal GCN. Our main contributions are: i) we propose a spatial-temporal network with two branches for group activity recognition utilizing RGB images and human skeletons. Experiments show that our model achieves 97.1% MCA and 96.1% MPCA on the Collective Activity dataset and 94.0% MCA and 94.4% MPCA on the Volleyball dataset. ii) we extend the two datasets by introducing human skeleton annotations, namely human joint coordinates and confidence, which can also be used in the action recognition task. The code is available at https://github.com/zxll0106/Image_and_Skeleton_Based_Group_Activity_Recognition.

Learning Action Correlation and Temporal Aggregation for Group Representation

Learning Visual Context for Group Activity Recognition.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Latent Embeddings for Collective Activity Recognition

Hierarchical Deep Temporal Models for Group Activity Recognition

Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition.

Part-Aware Spatial-Temporal Graph Convolutional Network for Group Activity Recognition.

Temporal Enhance and Spatial Gated Network for Group Activity Recognition

Group Activity Recognition by Using Effective Multiple Modality Relation Representation with Temporal-Spatial Attention

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Spatiotemporal Information Complementary Modeling and Group Relationship Reasoning for Group Activity Recognition

GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer

Improved Actor Relation Graph based Group Activity Recognition

Learning Group Residual Representation for Group Activity Prediction.

Spatial Temporal Network for Image and Skeleton Based Group Activity Recognition

Modeling Multi-Scale Sub-Group Context for Group Activity Recognition

Group Activity Recognition Based on Temporal Semantic Sub-Graph Network.

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Hierarchical Attention and Context Modeling for Group Activity Recognition.

Multi-Perspective Representation to Part-Based Graph for Group Activity Recognition.

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition