Multimodal Crowd Counting with Mutual Attention Transformers

Zhengtao Wu,Lingbo Liu,Yang Zhang,Mingzhi Mao,Liang Lin,Guanbin Li
DOI: https://doi.org/10.1109/icme52920.2022.9859777
2022-01-01
Abstract:Crowd counting is a fundamental yet challenging task that aims to automatically estimate the number of people in crowded scenes. Nowadays, with the rapid development of thermal and depth sensors, thermal images and depth maps become more accessible, which are proven to be beneficial information in boosting the performance of crowd counting. Consequently, we propose a Mutual Attention Transformer (MAT) module to fully leverage the complementary information of different modalities. Specifically, our MAT employs a cross-modal mutual attention mechanism to utilize the features of one modality to enhance the features of the other. Moreover, to improve performance by learning better visual representation and further exploiting modality-wise comple-mentarity, we design a self-supervised pre-training method based on cross-modal image reconstruction. Extensive experiments on two standard benchmarks (i.e., RGBT-CC and ShanghaiTechRGBD) show that the proposed method is effective and universal for multimodal crowd counting, outper-forming previous state-of-the-art methods.
What problem does this paper attempt to address?