CrowdCaption++: Collective-guided Crowd Scenes Captioning

Lanxiao Wang,Hongliang Li,Minjian Zhang,Heqian Qiu,Fanman Meng,Qingbo Wu,Linfeng Xu
DOI: https://doi.org/10.1109/tmm.2023.3328189
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Crowd scenes analysis plays an important role in various fields, including public security, smart cities, and intelligent transportation systems. However, traditional crowd scenes captioning methods mainly focus on a single and prominent crowd collective, which limits their ability to describe the different crowd collectives in complex crowd scenes. To address this issue, we propose a collective-guided crowd scenes captioning model (CrowdCaption++) to explore a more comprehensive and detailed description. We design a crowd features encoder (CFE) including double-query features encoder and foreground crowd features encoder, which uses double-query attention module (DQ-ATT) to capture more representative visual features and extracts foreground crowd features to avoid interference from background for collectives prediction. Moreover, we build a collective-guided captioning decoder (CCD) to generate captions of different crowd collectives without requiring extra alignment between crowd collectives and captions. To achieve this, we first design a crowd collectives predictor to identify multiple potential crowd collectives and create crowd collectives guidance information. Finally, we use the crowd collectives guidance information to merge useful visual features and further generate corresponding caption. We evaluate our approach on the latest crowd scenes dataset CrowdCaption and demonstrate that our model can achieve a comprehensive understanding and describe the different crowd collectives in complex crowd scenes.
What problem does this paper attempt to address?