Abstract:Group Activity Recognition aims to understand collective activities from videos. Existing solutions primarily rely on the RGB modality, which encounters challenges such as background variations, occlusions, motion blurs, and significant computational overhead. Meanwhile, current keypoint-based methods offer a lightweight and informative representation of human motions but necessitate accurate individual annotations and specialized interaction reasoning modules. To address these limitations, we design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity, offering an effective alternative to RGB video. This panoramic graph enables Graph Convolutional Network (GCN) to unify intra-person, inter-person, and person-object interactive modeling through spatial-temporal graph convolutions. In practice, we develop a novel pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and employ Multi-person Panoramic GCN (MP-GCN) to predict group activities. Extensive experiments on Volleyball and NBA datasets demonstrate that the MP-GCN achieves state-of-the-art performance in both accuracy and efficiency. Notably, our method outperforms RGB-based approaches by using only estimated 2D keypoints as input. Code is available at <a class="link-external link-https" href="https://github.com/mgiant/MP-GCN" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address several key challenges in Group Activity Recognition (GAR). Specifically, existing methods primarily rely on RGB video modalities, which bring issues such as background changes, occlusion, motion blur, and significant computational overhead. Although keypoint-based methods can provide lightweight and informative representations, they require precise individual annotations and specialized interaction reasoning modules.
To solve these problems, the authors designed a panoramic graph that integrates multiple human skeletons and objects to encapsulate group activities. This panoramic graph unifies the modeling of interactions within individuals, between individuals, and between individuals and objects through spatial-temporal graph convolutions. Using this approach, the authors proposed a new pipeline that extracts skeleton coordinates using pose estimation and tracking algorithms and predicts group activities using a Multi-person Panoramic GCN (MP-GCN).
### Main Contributions
1. **New Pipeline**: Developed a new skeleton-based group activity recognition pipeline that does not require ground truth individual boxes and labels. This pipeline includes obtaining keypoints through pose estimation and tracking algorithms and using a skeleton-based graph convolutional network for activity recognition.
2. **Panoramic Graph**: Designed a panoramic multi-person-object graph to represent group activities, addressing the shortcomings of previous methods and unifying the modeling of interactions within and between individuals.
3. **Performance Improvement**: Using only estimated human poses and object keypoints, this method significantly outperforms existing methods on three widely used datasets, with much lower computational costs compared to RGB-based methods.
### Experimental Results
- **Volleyball Dataset**:
- Under fully supervised settings, using ground truth boxes, MP-GCN achieved 95.5% multi-class classification accuracy (MCA) and 84.6% individual action classification mean accuracy (IMCA).
- Combined with the VGG-16 backbone network, the final MCA reached 96.2%.
- Under weakly supervised settings, using YOLOv8 estimated trajectories, MP-GCN still achieved 92.8% MCA and 96.1% merged MCA (MMCA).
- **NBA Dataset**:
- On the NBA dataset, MP-GCN achieved 76.0% MCA and 71.9% MPCA at T=72.
- Using a late fusion strategy, MCA further improved to 78.7%, and MPCA improved to 74.6%.
### Summary
By introducing panoramic graphs and multi-person graph convolutional networks, this paper effectively addresses several challenges in RGB video-based group activity recognition, providing higher accuracy and lower computational costs.