Abstract:Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate individual annotations and specialized interaction reasoning modules. To address these limitations, we design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity, offering an effective alternative to RGB video. This panoramic graph enables Graph Convolutional Network (GCN) to unify intra-person, inter-person, and person-object interactive modeling through spatial-temporal graph convolutions. In practice, we develop a novel pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and employ Multi-person Panoramic GCN (MP-GCN) to predict group activities. Extensive experiments on Volleyball and NBA datasets demonstrate that the MP-GCN achieves state-of-the-art performance in both accuracy and efficiency. Notably, our method outperforms RGB-based approaches by using only estimated 2D keypoints as input. Code is available at <a class="link-external link-https" href="https://github.com/mgiant/MP-GCN" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address several key challenges in Group Activity Recognition (GAR). Specifically, existing methods primarily rely on RGB video modalities, which bring issues such as background changes, occlusion, motion blur, and significant computational overhead. Although keypoint-based methods can provide lightweight and informative representations, they require precise individual annotations and specialized interaction reasoning modules. To solve these problems, the authors designed a panoramic graph that integrates multiple human skeletons and objects to encapsulate group activities. This panoramic graph unifies the modeling of interactions within individuals, between individuals, and between individuals and objects through spatial-temporal graph convolutions. Using this approach, the authors proposed a new pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and predicts group activities using a Multi-person Panoramic GCN (MP-GCN). ### Main Contributions 1. **New Pipeline**: Developed a new skeleton-based group activity recognition pipeline that does not require ground truth individual boxes and labels. This pipeline includes obtaining keypoints through pose estimation and tracking algorithms and using a skeleton-based graph convolutional network for activity recognition. 2. **Panoramic Graph**: Designed a panoramic multi-person-object graph to represent group activities, addressing the shortcomings of previous methods and unifying the modeling of interactions within and between individuals. 3. **Performance Improvement**: Using only estimated human poses and object keypoints, this method significantly outperforms existing methods on three widely used datasets, with much lower computational costs compared to RGB-based methods. ### Experimental Results - **Volleyball Dataset**: - Under fully supervised settings, using ground truth boxes, MP-GCN achieved 95.5% multi-class classification accuracy (MCA) and 84.6% individual action classification mean accuracy (IMCA). - Combined with the VGG-16 backbone network, the final MCA reached 96.2%. - Under weakly supervised settings, using YOLOv8 estimated trajectories, MP-GCN still achieved 92.8% MCA and 96.1% merged MCA (MMCA). - **NBA Dataset**: - On the NBA dataset, MP-GCN achieved 76.0% MCA and 71.9% MPCA at T=72. - Using a late fusion strategy, MCA further improved to 78.7%, and MPCA improved to 74.6%. ### Summary By introducing panoramic graphs and multi-person graph convolutional networks, this paper effectively addresses several challenges in RGB video-based group activity recognition, providing higher accuracy and lower computational costs.

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Learning Visual Context for Group Activity Recognition.

Spatial Temporal Network for Image and Skeleton Based Group Activity Recognition

Skeleton-based relational reasoning for group activity analysis

Part-Aware Spatial-Temporal Graph Convolutional Network for Group Activity Recognition.

Skeleton-Based Action Recognition with Spatial-Structural Graph Convolution

Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition

Multi-Perspective Representation to Part-Based Graph for Group Activity Recognition.

Group Activity Recognition by Using Effective Multiple Modality Relation Representation with Temporal-Spatial Attention

3D Skeleton-Based Video Action Recognition by Graph Convolution Network

Pose-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Improved Actor Relation Graph based Group Activity Recognition

Skeleton-based Human Action Recognition via Large-kernel Attention Graph Convolutional Network

Self-Relational Graph Convolution Network for Skeleton-Based Action Recognition

Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Learning Multi-View Interactional Skeleton Graph for Action Recognition

Skeleton action recognition via graph convolutional network with self-attention module

Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition

Skeleton-based action recognition with local dynamic spatial-temporal aggregation

Multi-Scale Adaptive Aggregate Graph Convolutional Network for Skeleton-Based Action Recognition

Dynamic Semantic-Based Spatial-Temporal Graph Convolution Network for Skeleton-Based Human Action Recognition