Temporal Contrastive and Spatial Enhancement Coarse Grained Network for Weakly Supervised Group Activity Recognition

Jie Guo,Yongxin Ge
DOI: https://doi.org/10.1016/j.engappai.2024.108115
IF: 8
2024-01-01
Engineering Applications of Artificial Intelligence
Abstract:Group activity recognition (GAR) is an increasingly popular topic in the field of computer vision. Numerous researchers have proposed a range of methods to achieve outstanding recognition performance. However, these methods invariably require fine-grained personal feature extraction and a large network architecture to aggregate individual features or reason person relationships. To mitigate the need for a bloated portfolio of annotations and high training costs, weak supervision has emerged as a promising approach. Under the weak supervision paradigm, only coarse-grained labels are used during network training. Nevertheless, this method poses two key challenges. Firstly, it is limited in its ability to model temporal relationships among individual persons, and secondly, it tends to focus on less relevant information, thereby leading to suboptimal network parameter optimization. Both of these challenges result in erroneous temporal information judgment and training inefficiencies. To address these challenges within the weak supervision paradigm, we propose a novel Temporal Contrastive and Spatial Enhancement Coarse-Grained Network (TCSE-CGN) to solve the GAR problem. TCSE-CGN comprises two simple yet effective streams, namely the Spatial Enhancement Stream and the Temporal Contrastive Stream. After extracting features using only several RGB frames, half of the extracted feature is sent to the Spatial Enhancement Stream for enhancement using an attention mechanism. Consequently, the network automatically learns more representative information. The remaining feature is sent to the Temporal Contrastive Stream, which uses contrastive learning to model temporal relationships among all RGB frame-level features. Specifically, the network is guided to learn the hidden semantic temporal information about inter-frame sequences. Network parameters are optimized using a combination of universe cross-entropy loss and a novel temporal contrastive loss. Comprehensive experiments are conducted on two widely used datasets, namely the Volleyball dataset and the Collective dataset, to demonstrate the effectiveness of TCSE-CGN. Results show that TCSE-CGN performs competitively with other works that require more supervision and a larger architecture.
What problem does this paper attempt to address?