Abstract:Crowd counting has received extensive attention in the field of computer vision, and methods based on deep convolutional neural networks (CNNs) have made great progress in this task. However, challenges such as scale variation, nonuniform distribution, complex background, and occlusion in crowded scenes hinder the performance of these networks in crowd counting. In order to overcome these challenges, this article proposes a multiscale spatial guidance perception aggregation network (MGANet) to achieve efficient and accurate crowd counting. MGANet consists of three parts: multiscale feature extraction network (MFEN), spatial guidance network (SGN), and attention fusion network (AFN). Specifically, to alleviate the scale variation problem in crowded scenes, MFEN is introduced to enhance the scale adaptability and effectively capture multiscale features in scenes with drastic scale variation. To address the challenges of nonuniform distribution and complex background in population, an SGN is proposed. The SGN includes two parts: the spatial context network (SCN) and the guidance perception network (GPN). SCN is used to capture the detailed semantic information between the multiscale feature positions extracted by MFEN, and improve the ability of deep structured information exploration. At the same time, the dependence relationship between the spatial remote context is established to enhance the receptive field. GPN is used to enhance the information exchange between channels and guide the network to select appropriate multiscale features and spatial context semantic features. AFN is used to adaptively measure the importance of the above different features, and obtain accurate and effective feature representations from them. In addition, this article proposes a novel region-adaptive loss function, which optimizes the regions with large recognition errors in the image, and alleviates the inconsistency between the training target and the evaluation metric. In order to evaluate the performance of the proposed method, extensive experiments were carried out on challenging benchmarks including ShanghaiTech Part A and Part B, UCF-CC-50, UCF-QNRF, and JHU-CROWD ++ . Experimental results show that the proposed method has good performance on all four datasets. Especially on ShanghaiTech Part A and Part B, CUCF-QNRF, and JHU-CROWD ++ datasets, compared with the state-of-the-art methods, our proposed method achieves superior recognition performance and better robustness.

Video Crowd Localization with Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

Video Crowd Localization with Multi-focus Gaussian Neighbor Attention and a Large-Scale Benchmark

Multi-branch Progressive Embedding Network for Crowd Counting

Crowd Counting Based on Multiscale Spatial Guided Perception Aggregation Network

A Crowd Counting and Localization Network Based on Adaptive Feature Fusion and Multi-Scale Global Attention Up Sampling

LEVERAGE MULTI-SCALE DILATED CONVOLUTIONAL NEURAL NETWORK WITH GLOBAL ATTENTION FEATURE FUSION FOR CROWD COUNTING

Motion-guided Non-local Spatial-Temporal Network for Video Crowd Counting

Beyond Counting: Point Supervised Attention Guided Neural Network for Crowded Object Locating

Multi-level Feature Fusion Based Locality-Constrained Spatial Transformer Network for Video Crowd Counting.

Motional foreground attention-based video crowd counting

Spatial-Frequency Attention Network for Crowd Counting

Multi-Person Gaze-Following with Numerical Coordinate Regression

Recurrent Attentive Zooming for Joint Crowd Counting and Precise Localization

Locality-constrained Spatial Transformer Network for Video Crowd Counting

3D Crowd Counting via Geometric Attention-guided Multi-View Fusion

MLANet: multi-level attention network with multi-scale feature fusion for crowd counting

An Adaptive Multi-Scale Network Based on Depth Information for Crowd Counting

Multi-Scale Guided Attention Network for Crowd Counting

SGCNet: Scale-aware and global contextual network for crowd counting

Dual Path Multi-Scale Fusion Networks with Attention for Crowd Counting

A Dynamic-Attention On Crowd Region With Physical Optical Flow Features For Crowd Counting