Abstract:Group activity recognition (GAR) is an increasingly popular topic in the field of computer vision. Numerous researchers have proposed a range of methods to achieve outstanding recognition performance. However, these methods invariably require fine-grained personal feature extraction and a large network architecture to aggregate individual features or reason person relationships. To mitigate the need for a bloated portfolio of annotations and high training costs, weak supervision has emerged as a promising approach. Under the weak supervision paradigm, only coarse-grained labels are used during network training. Nevertheless, this method poses two key challenges. Firstly, it is limited in its ability to model temporal relationships among individual persons, and secondly, it tends to focus on less relevant information, thereby leading to suboptimal network parameter optimization. Both of these challenges result in erroneous temporal information judgment and training inefficiencies. To address these challenges within the weak supervision paradigm, we propose a novel Temporal Contrastive and Spatial Enhancement Coarse-Grained Network (TCSE-CGN) to solve the GAR problem. TCSE-CGN comprises two simple yet effective streams, namely the Spatial Enhancement Stream and the Temporal Contrastive Stream. After extracting features using only several RGB frames, half of the extracted feature is sent to the Spatial Enhancement Stream for enhancement using an attention mechanism. Consequently, the network automatically learns more representative information. The remaining feature is sent to the Temporal Contrastive Stream, which uses contrastive learning to model temporal relationships among all RGB frame-level features. Specifically, the network is guided to learn the hidden semantic temporal information about inter-frame sequences. Network parameters are optimized using a combination of universe cross-entropy loss and a novel temporal contrastive loss. Comprehensive experiments are conducted on two widely used datasets, namely the Volleyball dataset and the Collective dataset, to demonstrate the effectiveness of TCSE-CGN. Results show that TCSE-CGN performs competitively with other works that require more supervision and a larger architecture.

Temporal Contrastive and Spatial Enhancement Coarse Grained Network for Weakly Supervised Group Activity Recognition

Learning Visual Context for Group Activity Recognition.

Educational tool for hospital-based training in family medicine.

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Dynamical Attention Hypergraph Convolutional Network for Group Activity Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Social Adaptive Module for Weakly-supervised Group Activity Recognition

SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Augmented skeleton sequences with hypergraph network for self-supervised group activity recognition

TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning

Group Contextualization for Video Recognition

DECOMPL: Decompositional Learning with Attention Pooling for Group Activity Recognition from a Single Volleyball Image

Group Activity Recognition Based on Temporal Semantic Sub-Graph Network.

Attentive spatial-temporal contrastive learning for self-supervised video representation

Multi-dimensional convolution transformer for group activity recognition

Separately Guided Context-Aware Network for Weakly Supervised Temporal Action Detection

Class-related Graph Convolution for Weakly Supervised Semantic Segmentation

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Guidance and Teaching Network for Video Salient Object Detection

Detector-Free Weakly Supervised Group Activity Recognition