Abstract:Crowd-counting networks have become the mainstream method to deploy crowd-counting techniques on resource-constrained devices. Significant progress has been made in this field, with many outstanding lightweight models being proposed successively. However, challenges like scare-variation, global feature extraction, and fine-grained head annotation requirements still exist in relevant tasks, necessitating further improvement. In this article, we propose a weakly supervised hybrid lightweight crowd-counting network that integrates the initial layers of GhostNet as the backbone to efficiently extract local features and enrich intermediate features. The incorporation of a modified Swin-Transformer block addresses the need for effective global context information. A Pyramid Pooling Aggregation Module handles the inherent scale variation problem in crowd-counting tasks in a more computation-efficient way. This module, along with the cross-attention module, serves as bridges to promote the feature information flow between local features and global context information. Finally, a simplified regressor module is designed to allow the proposed model with weakly supervised guidance for training to avoid precise location-level annotations, noting that the omission of density map generation makes the proposed network more lightweight. Our results on the UCF-QNRF dataset indicate our model is 8.73% and 12.17% more accurate on MAE and MSE metrics, respectively, than the second-best ARNet, in which the parameters decrease by 4.52%. On the ShanghaiTech A dataset, MAE and MSE drop 1.5% and 3.2%, respectively, compared to the second-best PDDNet. The experimental results for accuracy and inference speed evaluation on some mainstream datasets validate the effective design principle of our model.

CLFormer: a unified transformer-based framework for weakly supervised crowd counting and localization

CLDE-Net: crowd localization and density estimation based on CNN and transformer network

Multi-branch Progressive Embedding Network for Crowd Counting

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Transformer-CNN Hybrid Network for Crowd Counting

TransCrowd: weakly-supervised crowd counting with transformers

An interactive network based on transformer for multimodal crowd counting

LEVERAGE MULTI-SCALE DILATED CONVOLUTIONAL NEURAL NETWORK WITH GLOBAL ATTENTION FEATURE FUSION FOR CROWD COUNTING

A Crowd Counting and Localization Network Based on Adaptive Feature Fusion and Multi-Scale Global Attention Up Sampling

CCTrans: Simplifying and Improving Crowd Counting with Transformer

A Weakly Supervised Hybrid Lightweight Network for Efficient Crowd Counting

Concise Convolutional Neural Network for Crowd Counting

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

Adaptive Context Learning Network for Crowd Counting.

Hierarchical Inverse Distance Transformer for Enhanced Localization in Dense Crowds

CLRNet: A Cross Locality Relation Network for Crowd Counting in Videos

Learning Discriminative Features for Crowd Counting

Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework

Cross-scale Vision Transformer for crowd localization

A Dilated Convolutional Neural Network for Cross-Layers of Contextual Information for Congested Crowd Counting