Characterizing Hierarchical Semantic-Aware Parts with Transformers for Generalized Zero-Shot Learning

Peng Zhao,Xiaoming Xi,Qiangchang Wang,Yilong Yin
DOI: https://doi.org/10.1109/tcsvt.2024.3422491
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:This paper presents a novel Transformer architecture for zero-shot learning (ZSL), termed TransZSL, which can characterize hierarchical semantic-aware parts. It consists of an adaptive token refinement (ATR), a hierarchical token aggregation (HTA), and semantic-aware prototypes (SAP). Firstly, the ViT is used as the backbone that provides comprehensive local information without missing details. To address the different degrees of noise caused by large appearance variations, the ATR is proposed to highlight important tokens and suppress useless ones adaptively. However, due to the complex image structure, some important tokens may be incorrectly discarded. Therefore, a random perturbation is proposed to reactivate discarded tokens randomly, reducing the risk of missing discriminative information. Secondly, dataset descriptions contain both low- and high-level attributes. To this end, the HTA aggregates complementary hierarchical tokens from multiple ViT layers. Thirdly, semantically similar content may be distributed in different tokens. To overcome this issue, the SAP is proposed to group semantically identical tokens into one prototype, focusing on semantic-aware parts. Besides, diversity loss is used to encourage networks to learn diverse prototypes that discover diverse parts. Both qualitative and quantitative results on several challenging tasks demonstrate the usefulness and effectiveness of our proposed methods.
What problem does this paper attempt to address?