CLFormer: a unified transformer-based framework for weakly supervised crowd counting and localization

Deng, Mingfang,Zhao, Huailin,Gao, Ming
DOI: https://doi.org/10.1007/s00371-023-02831-z
2023-04-13
Abstract:Recent progress in crowd counting and localization methods mainly relies on expensive point-level annotations and convolutional neural networks with limited receptive filed, which hinders their applications in complex real-world scenes. To this end, we present CLFormer, a Transformer-based weakly supervised crowd counting and localization framework. The model extracts global information from the input image using a Transformer and then passes the extracted features to both a regression branch for crowd counting and a localization branch for localization. Initial proposals are produced by the localization branch and filtered via score maps generated from the extracted features, and their centers are used as pseudo-point-level annotations. Through staggered training of the two branches, the quality of pseudo-point-level annotations is improved, and the final localization maps are generated. Experiments on four benchmark datasets (i.e., ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd) demonstrate that CLFormer obtains better counting performance than weakly supervised and fully supervised counting networks and comparable localization performance to fully supervised localization networks.
computer science, software engineering
What problem does this paper attempt to address?