Abstract:Generative models (e.g., generative adversarial network (GAN)) have advanced zero-shot learning (ZSL). Studies on the generative ZSL methods typically produce visual features of unseen classes to mitigate the issue of lacking unseen samples based on the predefined class semantic prototypes. As these empirically designed prototypes are not able to faithfully represent the actual semantic prototypes of visual features (i.e., visual prototypes), existing methods limit their ability to synthesize visual features that accurately represent real features and prototypes. We formulate this phenomenon as a visual-semantic domain shift problem. It prevents the generative models from further improving the ZSL performance. In this paper, we propose a dynamic semantic prototype learning (DSP) method to align the empirical and actual semantic prototypes for synthesizing accurate visual features. The alignment is conducted by jointly refining semantic prototypes and visual features so that the generator synthesizes visual features which are close to the real ones. We utilize a visual$\rightarrow$semantic mapping network (V2SM) to map both the synthesized and real features into the class semantic space. The V2SM benefits the generator to synthesize visual representations with rich semantics. The real/synthesized visual features supervise our visual-oriented semantic prototype evolving network (VOPE) where the predefined class semantic prototypes are iteratively evolved to become dynamic semantic prototypes. Such prototypes are then fed back to the generative network as conditional supervision. Finally, we enhance visual features by fusing the evolved semantic prototypes into their corresponding visual features. Our extensive experiments on three benchmark datasets show that our DSP improves existing generative ZSL methods, \textit{e.g.}, the average improvements of the harmonic mean over four baselines (e.g., CLSWGAN, f-VAEGAN, TF-VAEGAN and FREE) by 8.5\%, 8.0\% and 9.7\% on CUB, SUN and AWA2, respectively.

DSPformer: Discovering Semantic Parts with Token Growth and Clustering for Zero-Shot Learning

GENERATING MANIFOLD-ALIGNED SEMANTIC FEATURE FOR ZERO-SHOT LEARNING

Characterizing Hierarchical Semantic-Aware Parts with Transformers for Generalized Zero-Shot Learning

Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.

DSP: Dynamic Semantic Prototype for Generative Zero-Shot Learning

Temporal–Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

[CLS] Token is All You Need for Zero-Shot Semantic Segmentation

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Semantic Softmax Loss for Zero-Shot Learning

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

Evolving Semantic Prototype Improves Generative Zero-Shot Learning

Exploiting Semantic Attributes for Transductive Zero-Shot Learning

TransZero: Attribute-guided Transformer for Zero-Shot Learning

Zero-Shot Learning via Discriminative Dual Semantic Auto-Encoder

Explanatory Object Part Aggregation for Zero-Shot Learning.

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Delving into Shape-aware Zero-shot Semantic Segmentation

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

SVDML: Semantic and Visual Space Deep Mutual Learning for Zero-Shot Learning.

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning