Semantic-guided RGB-Thermal Crowd Counting with Segment Anything Model

Yaqun Fang,Yi Shi,Jia Bei,Tongwei Ren
DOI: https://doi.org/10.1145/3652583.3658108
2024-01-01
Abstract:RGB-Thermal (RGB-T) crowd counting leverages the complementary nature of visible light and thermal modalities for accurate counting. However, real-world scenarios often introduce challenges, such as misidentifying background elements like trees and lampposts as individuals, leading to inaccurate counts. Existing methods utilize segmentation as a preliminary procedure, which is constrained by segmentation accuracy. In this paper, we propose a novel method, utilizing the Segment Anything Model (SAM), to distinguish between the foreground and background of images. Specifically, we begin by utilizing SAM to obtain the semantic map of the original image. Subsequently, we extract the modality features and semantic features corresponding to the RGB and thermal modalities through multimodal feature extraction. These features are then fused using the Semantic-guide Feature Fusion module. Finally, the Multi-level Decoder is employed to generate the density map and the ultimate counting results. Our approach achieves state-of-the-art performance on the RGBT-CC dataset.
What problem does this paper attempt to address?