Towards zero-shot object counting via deep spatial prior cross-modality fusion
Jinyong Chen,Qilei Li,Mingliang Gao,Wenzhe Zhai,Gwanggil Jeon,David Camacho
DOI: https://doi.org/10.1016/j.inffus.2024.102537
IF: 18.6
2024-06-20
Information Fusion
Abstract:Existing counting models predominantly operate on a specific category of objects, such as crowds and vehicles. The recent emergence of multi-modal foundational models, e.g. , Contrastive Language-Image Pre-training (CLIP), has facilitated class-agnostic counting. This involves counting objects of any given class from a single image based on textual instructions. However, CLIP-based class-agnostic counting models face two primary challenges. Firstly, the CLIP model lacks sensitivity to location information. It generally considers global content rather than the fine-grain location of objects. Therefore, adapting the CLIP model directly is suboptimal. Secondly, these models often freeze pre-trained vision and language encoders, while neglecting the potential misalignment in the constructed hypothesis space. In this paper, we address these two issues in a unified framework termed Deep Spatial Prior Interaction (DSPI) network. The DSPI leverages the spatial-awareness ability of large-scale pre-trained object grounding models, i.e., Grounding DINO, to incorporate spatial location as an additional prior for a specific query class. This enables the network to be more specifically focused on the precise location of the objects. Additionally, to align the feature space across different modalities, we tailor a meta adapter that extracts textual information into an object query. This serves as an instruction for cross-modality matching. These two modules collaboratively ensure the alignment of multi-modal representations while preserving their discriminative nature. Comprehensive experiments conducted on a diverse set of benchmarks verify the superiority of the proposed model. The code is available at https://github.com/jinyongch/DSPI .
computer science, artificial intelligence, theory & methods