Group Activity Representation Learning with Long-Short States Predictive Transformer.

Longteng Kong,Wanting Zhou,Duoxuan Pei,Zhaofeng He,Di Huang
DOI: https://doi.org/10.1109/tcsvt.2023.3278984
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The research goal of this paper is to learn the group activity representations in a self-supervised fashion instead of through the use of conventional methods that rely on manually annotated labels. It is essential for this task to better describe the complex group states and their future transitions. To this end, we propose a long-short state predictive Transformer (LSSPT), which mines the meaningful spatiotemporal features of group activities by predicting the future group states with long- and short-term historical state dynamics. LSSPT consists of an encoder that models diverse spatiotemporal state representations in the observation, together with a decoder that exploits rich dynamic patterns by attending to both the short-term spatial context and long-term history state evolutions to predict future group states. Furthermore, we consider the distinguishability and consistency of the predicted states and introduce a joint learning mechanism to optimize the models, enabling LSSPT to describe more reliable state transitions. Finally, extensive experiments are carried out to evaluate the learned representation on downstream tasks on the Volleyball, Collective Activity and VolleyTactic datasets, which showcases the method’s state-of-the-art performance over the existing self-supervised learning approaches.
What problem does this paper attempt to address?