HumanFormer: Human-centric Prompting Multi-modal Perception Transformer for Referring Crowd Detection

Heqian Qiu,Lanxiao Wang,Taijin Zhao,Fanman Meng,Hongliang Li
DOI: https://doi.org/10.1109/cvprw63382.2024.00562
2024-01-01
Computer Vision and Pattern Recognition
Abstract:As an important step towards crowd understanding, referring crowd detection (RCD) aims to locate the person in human crowded environments described by a natural language expression. Existing methods either rely on ambiguous object-based or token-based features for general scene understanding. However, both of them ignore diverse fine-grained human properties and complex relationships, crucial for locating the target person within similar persons. In this paper, we propose a novel human-centric prompting multi-modal perception transformer (HumanFormer) to explicitly align fine-grained human concept information between visual and language modalities for accurate referring crowd detection. Specifically, we introduce a human-centric prompt exporter to adaptively exploit various human-related parts and attribute prompt representation with prior knowledge. Based on part-level prompts, we then design a part-prompting multi-modal encoder finely achieves cross-modal focusing fusion within each part region to avoid interference from irrelevant regions. Furthermore, we leverage an attribute-prompting reasoning decoder to gradually infer the final object location according to their interactive relationships with fine-grained attribute representation, language, and vision sequentially. Extensive experimental results on the challenging RefCrowd, other general benchmarks and JRDB dataset demonstrate the effectiveness and generality of the proposed method.
What problem does this paper attempt to address?