Abstract:In this paper, we tackle the problem of video crowd counting. Compared with single image crowd counting, video provides gradual spatial and temporal variation information that would help to strengthen the robustness of crowd counting. Therefore, it is critical to make full use of neighboring frames both in feature extraction and final prediction for current frame's estimation. Based on the above observations, we propose a motional foreground attention-based video crowd counting method. Specifically, we first leverage an foreground estimation module based on ConvNeXt to extract the motional features from bidirectional frame differences and output a foreground estimation map. Then the motional features combined with the static features of current frame are sent into feature fusion network, where foreground estimation map is transformed as attention weights for crowd number estimation. Three new indoor video datasets are manually annotated. The proposed method achieves state-of-the-art performance on all indoor and outdoor video datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is video crowd counting. Compared with crowd counting based on a single image, videos provide gradual spatial and temporal change information, which helps to enhance the robustness of crowd counting. Therefore, making full use of the information of adjacent frames in feature extraction and final prediction is crucial for the estimation of the current frame. Based on the above observations, the author proposes a video crowd - counting method based on motion foreground attention. Specifically, this method first uses a ConvNeXt - based foreground estimation module to extract motion features from bidirectional frame differences and outputs a foreground estimation map. Then, the motion features are combined with the static features of the current frame and sent to the feature fusion network. In this process, the foreground estimation map is converted into attention weights for the estimation of the crowd number. In addition, the author manually labels three new indoor video datasets to better evaluate the performance in indoor scenes. The proposed model achieves state - of - the - art performance on all indoor and outdoor video datasets. The main contributions of the paper include: 1. Proposing the use of bidirectional frame differences to model the spatial and temporal correlations between adjacent frames, which are applied in both feature extraction and final prediction, thereby improving the robustness of the model. 2. Adding an up - sampling block in ConvNeXt for the video crowd - counting task. As far as the author knows, this is the first time that ConvNeXt has been applied to the field of video crowd - counting. 3. Manually labeling three new video crowd - counting datasets and providing point annotations. The experimental results show that the proposed model outperforms the existing state - of - the - art crowd - counting methods on all indoor and outdoor video datasets.

Motional foreground attention-based video crowd counting

Relevant Region Prediction for Crowd Counting

Counting moving people in crowds using motion statistics of feature-points

Multi-branch Progressive Embedding Network for Crowd Counting

Motion-guided Non-local Spatial-Temporal Network for Video Crowd Counting

A Dynamic-Attention On Crowd Region With Physical Optical Flow Features For Crowd Counting

Crowd counting method based on the self-attention residual network

Frame-Recurrent Video Crowd Counting

LEVERAGE MULTI-SCALE DILATED CONVOLUTIONAL NEURAL NETWORK WITH GLOBAL ATTENTION FEATURE FUSION FOR CROWD COUNTING

Spatial-Frequency Attention Network for Crowd Counting

Concise Convolutional Neural Network for Crowd Counting

Video Crowd Localization with Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

Recurrent Fine-Grained Self-Attention Network for Video Crowd Counting

FDCNet: Frontend-Backend Fusion Dilated Network Through Channel-Attention Mechanism

MLANet: multi-level attention network with multi-scale feature fusion for crowd counting

Multi-level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting.

Video Crowd Localization with Multi-focus Gaussian Neighbor Attention and a Large-Scale Benchmark

Dual-branch counting method for dense crowd based on self-attention mechanism

Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-modal Crowd Counting

Correlation-attention guided regression network for efficient crowd counting

DRENet: Giving Full Scope to Detection and Regression-Based Estimation for Video Crowd Counting