Abstract:Video-based crowd counting can leverage the spatial-temporal information between neighboring frames, and thus this information would improve the robustness of crowd counting. Therefore, this solution is more practical than single image-based crowd counting in real applications. Since severe occlusions, translation, rotation, and scaling of persons will give rise to the change of density map of heads between neighboring frames, video-based crowd counting is a very challenging task. To alleviate these issues in video crowd counting, a Multi-Level Feature Fusion Based Locality-Constrained Spatial Transformer Network (MLSTN) is proposed, which consists of two components, namely density map regression module and Locality-Constrained Spatial Transformer (LST) module. Specifically, we first estimate the density map of each frame by utilizing the combination of the low-level, middle-level and high-level features of the Convolutional Neural Networks. This is because the low-level features may be more effective in the extraction of small head information, while the middle and high level features are more effective in the extraction of medium and large head information. Then to measure the relationship of the density maps between neighboring frames, the LST module is proposed, which estimates the density map of the next frame by concatenating several regression density maps. To facilitate the performance evaluation for video crowd counting, we have collected and labeled a large-scale video crowd counting dataset which includes 100 five-second-long sequences with 394,081 annotated heads from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments show the effectiveness of our proposed approach for crowd counting on our dataset and other video-based crowd counting datasets. All our dataset are released online.11https://github.com/sweetyy83/Lstn_fdst_dataset.

Spatiotemporal Dilated Convolution with Uncertain Matching for Video-based Crowd Estimation

LEVERAGE MULTI-SCALE DILATED CONVOLUTIONAL NEURAL NETWORK WITH GLOBAL ATTENTION FEATURE FUSION FOR CROWD COUNTING

DRENet: Giving Full Scope to Detection and Regression-Based Estimation for Video Crowd Counting

Distance-Aware Network for Physical-World Object Distribution Estimation and Counting

Crowd Counting by Multi-Scale Dilated Convolution Networks

ST-CNN: Spatial-Temporal Convolutional Neural Network for Crowd Counting in Videos.

Motion-guided Non-local Spatial-Temporal Network for Video Crowd Counting

Multi-scale Dilated Convolution of Feature Fusion Network for Crowd Counting

Enhanced 3D convolutional networks for crowd counting

An Improved Normed-Deformable Convolution for Crowd Counting

Multi-Dilation Network for Crowd Counting.

Crowd Counting Using Deep Recurrent Spatial-Aware Network.

Crowd Counting Based on Multiresolution Density Map and Parallel Dilated Convolution

A High-robustness and Low Resource-consumption Crowd Counting Model

Multi-level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting.

Crowd Counting From A Still Image Using Multi-Scale Fully Convolutional Network With Adaptive Human-Shaped Kernel

Global Representation Guided Adaptive Fusion Network for Stable Video Crowd Counting

Multi-scale Dilated Convolution of Convolutional Neural Network for Crowd Counting

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting

An Improved Dilated Convolutional Network for Herd Counting in Crowded Scenes