CLDE-Net: crowd localization and density estimation based on CNN and transformer network
Yaocong Hu,Yuanyuan Lin,Huicheng Yang,Bingyou Liu,Guoyang Wan,Jinwen Hong,Chao Xie,Wei Wang,Xiaobo Lu
DOI: https://doi.org/10.1007/s00530-024-01318-8
IF: 3.9
2024-04-10
Multimedia Systems
Abstract:Given a crowd image, there are two ways for human to approximate the counting number: exactly locating head points in each local region or directly estimating the total number of person based on the whole image. By imitating human visual perception, CNN and transformer are two mainstream models for solving crowd counting challenging, among which CNN has a strong ability to extract locality-oriented feature and transformer is suitable for modeling global dependencies. Based on the fact, in this paper, the proposed CLDE-Net is the first study that fulfills exact localization and direct estimation by designing the hybrid of CNN and transformer, to be specific, CNN searches all candidate head points in each local region and transformer learns the crowd density map with global receptive fields. Furthermore, we adopt two pipelines to further boost crowd counting performance: (1) cross-layer feature interaction module is employed to facilitate information transmission between two network branches of CNN and transformer and (2) dynamic factor generator is designed to adaptively fuse the result of head point localization and density map estimation. Extensive experiments show that the proposed CLDE-Net framework achieves the state-of-the-art performance on multiple data sets for crowd counting.
computer science, information systems, theory & methods