Heterogeneous Dual-Attentional Network for WiFi and Video-Fused Multi-modal Crowd Counting

Lifei Hao,Baoqi Huang,Bing Jia,Guoqiang Mao
DOI: https://doi.org/10.1109/tmc.2024.3444469
IF: 6.075
2024-01-01
IEEE Transactions on Mobile Computing
Abstract:Crowd counting aims to estimate the number of individuals in targeted areas. However, mainstream vision-based methods suffer from limited coverage and difficulty in multi-camera collaboration, which limits their scalability, whereas emerging WiFi-based methods can only obtain coarse results due to signal randomness. To overcome the inherent limitations of unimodal approaches and effectively exploit the advantage of multi-modal approaches, this paper presents an innovative WiFi and video-fused multi-modal paradigm by leveraging a heterogeneous dual-attentional network, which jointly models the intra- and inter-modality relationships of global WiFi measurements and local videos to achieve accurate and stable large-scale crowd counting. First, a flexible hybrid sensing network is constructed to capture synchronized multi-modal measurements characterizing the same crowd at different scales and perspectives; second, differential preprocessing, heterogeneous feature extractors, and self-attention mechanisms are sequentially utilized to extract and optimize modality-independent and crowd-related features; third, the cross-attention mechanism is employed to deeply fuse and generalize the matching relationships of two modalities. Extensive real-world experiments demonstrate that our method can significantly reduce the error by $26.2\%$ , improve the stability by $48.43\%$ , and achieve the accuracy of about $88\%$ in large-scale crowd counting when including the videos from two cameras, compared to the best WiFi unimodal baseline.
What problem does this paper attempt to address?