Encoder-Decoder Network Fusing Channel and Spatial Attention for Crowd Counting

YU Ying,PAN Cheng,ZHU Huilin,QIAN Jin,TANG Hong
DOI: https://doi.org/10.3778/j.issn.1673-9418.2104122
2022-01-01
Abstract:The purpose of crowd counting is to accurately predict the number, distribution and density of crowds in real scenes. However, crowd counting often suffers from some problems such as complex background, diverse target scales, and cluttered crowd distribution, which strongly affects the precision of counting. To solve these problems, a channel and spatial attention-based encoder-decoder network for crowd counting (CSANet) is proposed. It uses a multi-level encoder-decoder network to extract multi-scale semantic features, and fully integrates spatial context information to solve the problem of pedestrian scale changes and messy distribution in complex scenes. To reduce the impact of complex background on counting performance, channel and spatial attention are introduced in the process of feature fusion to improve the quality of crowd density map by increasing the feature weights of crowd regions to highlight regions of interest, and decreasing the feature weights of weakly correlated background regions to suppress background noise interference. To verify the effectiveness of the proposed algorithm, experiments are conducted on several classical crowd counting datasets, and the experimental results show that CSANet performs well in multi-scale feature extraction and background noise suppression compared with existing crowd counting algorithms, which greatly improves the accuracy and robustness of counting algorithm in dense scenes.
What problem does this paper attempt to address?