Learning Conditional Prompt for Compositional Zero-Shot Learning

Tian Zhang,Kongming Liang,Ke Zhang,Zhanyu Ma
DOI: https://doi.org/10.1109/icme57554.2024.10688263
2024-01-01
Abstract:Compositional zero-shot learning (CZSL) strives to learn attributes and objects from seen compositions and transfer the acquired knowledge to unseen compositions. Existing methods either learn primitive concepts in an entangled manner, leading to the model relying on spurious correlations between attributes and objects. Alternatively, they adopt a decoupled approach, causing the model to overlook relationships between attributes and objects. In this paper, we propose a conditional prompting (CoP) method to enhance the performance of vision-language models (e.g., CLIP) in CZSL. Specifically, we utilize two image-to-word mapping networks to learn pseudo attribute and object word embeddings that can represent the corresponding semantics of input images. Subsequently, the model recognizes one concept based on another generated pseudo word embeddings, enabling the recognition of individual sub-concepts while leveraging the image-specific correlations between attributes and objects. The experimental results on three CZSL benchmarks indicate that the proposed method achieves competitive performance compared to previous state-of-the-art methods.
What problem does this paper attempt to address?