Video Crowd Localization with Multi-focus Gaussian Neighbor Attention and a Large-Scale Benchmark

Haopeng Li,Lingbo Liu,Kunlin Yang,Shinan Liu,Junyu Gao,Bin Zhao,Rui Zhang,Jun Hou
2021-01-01
Abstract:Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit longrange correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi-focus mechanism. Based on the multi-focus GNA, we develop a unified neural network called GNANet to accurately locate head centers in video clips by fully aggregating spatialtemporal information via a scene modeling module and a context cross-attention module. Moreover, to facilitate future researches in this field, we introduce a large-scale crowd video benchmark named SenseCrowd, which consists of 60K+ frames captured in various surveillance scenarios and 2M+ head annotations. Finally, we conduct extensive experiments on three datasets including our SenseCrowd, and the experiment results show that the proposed method is capable to achieve state-of-the-art performance for both video crowd localization and counting. The code and the dataset will be released.
What problem does this paper attempt to address?