Discriminative Feature Focus Via Masked Autoencoder for Zero-Shot Learning

Jingqi Yang,Cheng Xie,Peng Tang
DOI: https://doi.org/10.1109/cscwd57460.2023.10152773
2023-01-01
Abstract:Zero-shot learning (ZSL) is an important research area in computer-supported cooperative work in design, especially in the field of visual collaborative computing. ZSL normally uses transferable semantic features to represent the visual features to predict unseen classes without training the unseen samples. Existing ZSL models have attempted to learn region features in a single image, while the discriminative attribute localization of visual features is typically neglected. To handle the mentioned problem, we propose a pre-trained Masked Autoencoders(MAE) based Zero-Shot Learning model. It uses multi-head self-attention in Transformer blocks to capture the most discriminative local features from a partial perspective by considering both positional and contextual information of the entire sequence of patches, which is consistent with the human attention mechanism when recognizing objects. Further, it uses a Multilayer Perceptron(MLP) to map visual features to the semantic space for relating visual and semantic attributes, and predicts the semantic information, which is used to find out the class label during inference. Both quantitative and qualitative experimental results on three popular ZSL benchmarks show the proposed method achieves the new state-of-the-art in the field of generalized zero-shot learning and conventional zero-shot learning. The source code of the proposed method is available at https://github.com/yangjingqi99/MAE-ZSL
What problem does this paper attempt to address?