PointerNet - Spatiotemporal Modeling for Crowd Counting in Videos.

Changsheng Liu,Yuan Huang,Yadong Mu,Xiaoming Yu
DOI: https://doi.org/10.1145/3480001.3480018
2021-01-01
Abstract:The existing video crowd counting methods via deep learning technique are mainly involved in how to leverage the temporal correlation to improve the model. Studies have shown that convolutional neural networks with spatiotemporal three-dimensional kernels (3D CNNs) are promising architectures on video crowd counting. However, the existing methods based on 3D CNNs are insufficient for very deep neural networks in 2D-based CNNs owing to their considerable number of parameters and lack of labeled data, which gives rise to overfitting of 3D CNNs and results in an unsatisfying video crowd counting performance. To address this issue, a novel end-to-end video crowd counting framework, named PointerNet (PseudO-3D (P3D) CNNs INtegrated with Temporal channEl-awaRe (TCA) block) is proposed. The use of P3D kernels causes our framework to possess greater structural diversity and go deep, while having a limited computational cost and memory demand. In addition, the temporal context-aware block was proposed and integrated into our architecture, which assists in exploiting the temporal interdependencies among video sequences. Experiments on three benchmark datasets indicates that the proposed method delivers a state-of-the-art performance.
What problem does this paper attempt to address?