Application of CLIP for Efficient Zero-Shot Learning

Hairui Yang,Ning Wang,Haojie Li,Lei Wang,Zhihui Wang
DOI: https://doi.org/10.1007/s00530-024-01414-9
IF: 3.9
2024-01-01
Multimedia Systems
Abstract:Zero-shot learning (ZSL) addresses the challenging task of recognizing classes absent during training. Existing methodologies focus on knowledge transfer from known to unknown categories by formulating a correlation between visual and semantic spaces. However, these methods are faced with constraints related to the discrimination of visual features and the integrity of semantic representations. To alleviate these limitations, we propose a novel Collaborative learning Framework for Zero-Shot Learning (CFZSL), which integrates the CLIP architecture into a fundamental zero-shot learner. Specifically, the foundational zero-shot learning model extracts visual features through a set of CNNs and maps them to a domain-specific semantic space. Simultaneously, the CLIP image encoder extracts visual features containing universal semantics. In this way, the CFZSL framework can obtain discriminative visual features for both domain-specific and domain-agnostic semantics. Additionally, a more comprehensive semantic space is explored by combining the latent feature space learned by CLIP and the domain-specific semantic space. Notably, we just leverage the pre-trained parameters of the CLIP model, mitigating the high training cost and potential overfitting issues associated with fine-tuning. Our proposed framework, characterized by its simple structure, undergoes training exclusively via classification and triplet loss functions. Extensive experimental results, conducted on three widely recognized benchmark datasets-AwA2, CUB, and SUN, conclusively affirm the effectiveness and superiority of our proposed approach.
What problem does this paper attempt to address?