Abstract:Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visual-semantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferable and discriminative attribute localization of visual features for representing the key semantic knowledge for effective knowledge transfer in ZSL. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for key semantic knowledge representations in ZSL. Specifically, TransZero++ employs an attribute → visual Transformer sub-net (AVT) and a visual → attribute Transformer sub-net (VAT) to learn attribute-based visual features and visual-based attribute features, respectively. By further introducing feature-level and prediction-level semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings for key semantic knowledge representations via semantical collaborative learning. Finally, the semantic-augmented visual embeddings learned by AVT and VAT are fused to conduct desirable visual-semantic interaction cooperated with class semantic vectors for ZSL classification. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three golden ZSL benchmarks and on the large-scale ImageNet dataset. The project website is available at: https://shiming-chen.github.io/TransZero-pp/TransZero-pp.html.

Multi-scale Visual Attention for Attribute Disambiguation in Zero-Shot Learning

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Multiscale Visual-Attribute Co-Attention for Zero-Shot Image Recognition

Attribute Attention for Semantic Disambiguation in Zero-Shot Learning

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

A Multi-Group Multi-Stream attribute Attention network for fine-grained zero-shot learning

Visual-guided attentive attributes embedding for zero-shot learning

Deep Semantic-Visual Alignment for Zero-Shot Remote Sensing Image Scene Classification

Learning complementary semantic information for zero-shot recognition

Stacked Semantic-Guided Attention Model for Fine-Grained Zero-Shot Learning.

Attentive Semantic Preservation Network for Zero-Shot Learning.

PSVMA+: Exploring Multi-granularity Semantic-visual Adaption for Generalized Zero-shot Learning

Dual Relation Mining Network for Zero-Shot Learning

Multi-modal Generative Adversarial Network for Zero-Shot Learning

ZS-VAT: Learning Unbiased Attribute Knowledge for Zero-Shot Recognition Through Visual Attribute Transformer

High-Discriminative Attribute Feature Learning for Generalized Zero-Shot Learning

Adaptive multi-scale semantic fusion network for zero-shot learning

Adaptive Relation-Aware Network for zero-shot classification

Attribute self-representation steered by exclusive lasso for zero-shot learning

A Discriminative Cross-Aligned Variational Autoencoder for Zero-Shot Learning

TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning