Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings. Code is available at: https://anonymous.4open.science/r/ZeroMamba.

What Remains of Visual Semantic Embeddings

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

GENERATING MANIFOLD-ALIGNED SEMANTIC FEATURE FOR ZERO-SHOT LEARNING

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

OntoZSL: Ontology-enhanced Zero-shot Learning

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

Learning a Deep Embedding Model for Zero-Shot Learning

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Meta-Transfer Networks for Zero-Shot Learning

Learning discriminative visual semantic embedding for zero-shot recognition

Semantic Softmax Loss for Zero-Shot Learning

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Zero-Shot Leaning With Manifold Embedding

Visually Aligned Word Embeddings for Improving Zero-shot Learning.

Disentangled Ontology Embedding for Zero-shot Learning

Information Bottleneck Constrained Latent Bidirectional Embedding for Zero-Shot Learning

Learning complementary semantic information for zero-shot recognition

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Zero-Shot Embedding via Regularization-Based Recollection and Residual Familiarity Processes

Zero-Shot Learning With Attentive Region Embedding and Enhanced Semantics