Abstract:This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.

Language-Augmented Pixel Embedding for Generalized Zero-Shot Learning

GENERATING MANIFOLD-ALIGNED SEMANTIC FEATURE FOR ZERO-SHOT LEARNING

OntoZSL: Ontology-enhanced Zero-shot Learning

SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized Zero-Shot Learning

Learn More from Less: Generalized Zero-Shot Learning with Severely Limited Labeled Data

Disentangled Ontology Embedding for Zero-shot Learning

Zero-Shot Embedding via Regularization-Based Recollection and Residual Familiarity Processes

Semantics Disentangling for Generalized Zero-Shot Learning

Towards Zero-Shot Learning: A Brief Review and an Attention-Based Embedding Network

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Generalized Zero-Shot Learning Via Multi-Modal Aggregated Posterior Aligning Neural Network

Learning MLatent Representations for Generalized Zero-Shot Learning

Learning complementary semantic information for zero-shot recognition

Semantic-guided Reinforced Region Embedding for Generalized Zero-Shot Learning

Learning a Deep Embedding Model for Zero-Shot Learning

Zero Shot Learning Via Low-rank Embedded Semantic AutoEncoder

Scalable Zero-Shot Learning Via Binary Visual-Semantic Embeddings

CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation

Simple Is Better: A Global Semantic Consistency Based End-To-End Framework For Effective Zero-Shot Learning

Semantic Autoencoder for Zero-Shot Learning

Zero-shot learning via a specific rank-controlled semantic autoencoder