ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Wenjin Hou,Dingjie Fu,Kun Li,Shiming Chen,Hehe Fan,Yi Yang

2024-08-27

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings. Code is available at: https://anonymous.4open.science/r/ZeroMamba.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address a key issue in Zero-Shot Learning (ZSL), which is how to recognize unseen categories through the semantic information of known categories. Specifically, existing methods utilize Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) to extract global visual features for visual-semantic interaction. However, due to the limited receptive field of CNNs and the quadratic complexity problem of ViTs, these methods perform poorly in visual-semantic interaction. The paper proposes a new framework based on the Visual State Space Model, called ZeroMamba, to improve zero-shot learning. ZeroMamba includes three main components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). These designs enable the model to better integrate visual and semantic features, thereby improving classification performance. Through experiments on multiple benchmark datasets, ZeroMamba demonstrates significantly better performance than existing methods and also has advantages in parameter efficiency.

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

OntoZSL: Ontology-enhanced Zero-shot Learning

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Visual and Semantic Prototypes-Jointly Guided CNN for Generalized Zero-shot Learning

Semantic Graph-enhanced Visual Network for Zero-shot Learning.

Learning Discriminative Projection with Visual Semantic Alignment for Generalized Zero Shot Learning.

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

TransZero: Attribute-guided Transformer for Zero-Shot Learning

Zero-Shot Learning Via Robust Latent Representation and Manifold Regularization

SVDML: Semantic and Visual Space Deep Mutual Learning for Zero-Shot Learning.

Zero-Shot Learning via Discriminative Dual Semantic Auto-Encoder

Joint Visual and Semantic Optimization for Zero-Shot Learning

Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning

HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

Visual–Semantic Graph Matching Net for Zero-Shot Learning

Zero-shot Recognition with Latent Visual Attributes Learning.

Learning discriminative visual semantic embedding for zero-shot recognition

Zero-Shot Learning Via Latent Space Encoding

Asymmetric Graph Based Zero Shot Learning