Abstract:Compositional Zero-Shot Learning (CZSL) is a particular Zero-Shot Learning (ZSL) task that aims to utilize known concepts (e.g., states and objects) to identify novel state-object compositions for Image Classification. Previous works have primarily focused on disentangling concept compositions or exploring the complex interactions between the states and objects while neglecting the critical fact that the inference of many states and compositions is related to different frequency components, which should be analyzed from the global perspective. Therefore, we propose a Spatial-frequency Feature Fusion Network (SFFNet) to introduce a new branch that utilizes a frequency-domain filtering encoder to enhance key frequency components and capture non-local interactions adaptively. Besides, we also find that the widely used backbone in conventional CZSL settings behaves superior in perceiving local features. Thus, we construct a fusion block to combine both strengths to capture the local and non-local information. In addition, the traditional one-hot ground-truth distribution in the training phase does not reflect the accurate relationships between compositions, so we propose a composition-relation based label distribution regularization to encourage the model to actively learn the inner relationships between compositions, and extend this method to construct unseen composition pseudo distribution to further enhance the model’s generalization ability to unseen compositions. Extensive experiments and detailed analysis are conducted on three popular datasets, and the results show that our method can achieve state-of-the-art performance, which reveals its superiority in identifying novel compositions. Code is available at https://github.com/lisuyi/SFFNet_czsl.

Learning to Embed Seen/Unseen Compositions based on Graph Networks

Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning.

Hierarchical Prompt Learning for Compositional Zero-Shot Recognition.

Fusing Spatial and Frequency Features for Compositional Zero-Shot Image Classification

Dual-Stream Contrastive Learning for Compositional Zero-Shot Recognition

Learning to Infer Unseen Single-/ Multi-Attribute-Object Compositions with Graph Networks.

Disentangling Before Composing: Learning Invariant Disentangled Features for Compositional Zero-Shot Learning

Compositional Zero-shot Learning Via Progressive Language-based Observations

Imaginary-Connected Embedding in Complex Space for Unseen Attribute-Object Discrimination

Continual Compositional Zero-Shot Learning

Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

Learning Conditional Prompt for Compositional Zero-Shot Learning

Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Learning Conditional Attributes for Compositional Zero-Shot Learning

Zero-Shot Compositional Concept Learning

Learning Invariant Visual Representations for Compositional Zero-Shot Learning

Cross-composition Feature Disentanglement for Compositional Zero-shot Learning