Recurrent Prediction with Spatio-Temporal Attention for Crowd Attribute Recognition

Qiaozhe Li,Xin Zhao,Ran He,Kaiqi Huang
DOI: https://doi.org/10.1109/tcsvt.2019.2923444
IF: 5.859
2019-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Crowd attribute recognition is a challenging task for crowd video understanding because a crowd video often contains multiple attributes from various types. Traditional deep learning-based methods directly treat this recognition problem as a multiple binary classification problem and represent the video by vectorizing and fusing the separately learned spatial and temporal features in the fully connected layers. Therefore, the correlations between these attributes may not be well captured. In this paper, a bidirectional recurrent prediction model with a semantic-aware attention mechanism is proposed to explore the spatio-temporal and semantic relations between the attributes for more accurate recognition. The ConvLSTM is introduced for feature representation to capture the spatio-temporal structure of the crowd videos and facilitate the visual attention. The bidirectional recurrent attention module is proposed for sequential attribute prediction by associating each subcategory attributes to corresponding semantic-related regions iteratively. The experiments and evaluations on the challenging WWW crowd video dataset not only show that our approach significantly outperforms the state-of-the-art methods but also verify that our approach can effectively capture the spatio-temporal and semantic relations of the crowd attributes.
What problem does this paper attempt to address?