LVAR-CZSL: Learning Visual Attributes Representation for Compositional Zero-Shot Learning
Xingjiang Ma,Jing Yang,Jiacheng Lin,Zhenzhe Zheng,Shaobo Li,Bingqi Hu,Xianghong Tang
DOI: https://doi.org/10.1109/tcsvt.2024.3444782
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Compositional Zero-Shot Learning (CZSL) has been applied to various scenarios, including scene understanding, visual-language representation, and domain adaptation. Despite numerous endeavours and significant advancements, the crucial issues of fuzzy conceptualization of visual attributes and insufficient inter-class connectivity, have remained insufficiently addressed. To address these issues, we propose Learning Visual Attributes Representation for Compositional Zero-Shot Learning (LVAR-CZSL), which has the ability to learn visual attributes and inter-class dependencies. LVAR-CZSL is mainly composed of two key components: the Visual Attribute Representation Module (VARM) and the Connected Learning Module (CLM). Specifically, VARM extracts detailed attributes and object visual features from global visual features, resolving the issue of fuzzy visual attribute concepts. Moreover, CLM endows LVAR-CZSL with the capability to perceive connectivity between different attributes and objects, effectively enhancing inter-class connectivity. To establish a close connection between VARM and CLM and minimize the gap between image and text features, we introduce the composition-attribute-object Joint Scoring Function (JSF). Additionally, we propose Joint Loss Function (JLF) to optimize the learning process of VARM and CLM. The experiment results on four datasets show that LVAR-CZSL achieves state-of-the-art performance. The code is available at https://github.com/mxjmxj1/LVAR-CZSL.