Abstract:Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the model's ability to recognize unseen attribute - object combinations in the Compositional Zero - Shot Learning (CZSL) task. Specifically, the paper points out that the current methods have the following three main problems: 1. **Difficulties in disentanglement**: Due to the influence of the background and the complex entanglement of attributes and objects in the same part, the existing disentanglement methods are not effective. The model tends to extract background features as exclusive features and faces significant challenges when performing disentanglement at the spatial level, because the spatial features of attributes and objects often overlap. 2. **Insufficient word embeddings**: The existing word embedding methods (such as Word2Vec and GloVe) are unable to capture complex multi - modal semantic information. These word embeddings are mainly based on word frequency and context co - occurrence and lack the ability to capture high - level semantic nuances. In addition, these word embeddings are trained only in a single text mode and cannot capture cross - modal information between images and text. 3. **Over - confidence**: Existing models show over - confidence in the seen combinations, which hinders their generalization ability to new combinations. Since one - hot labels are used in the training process, these models can only learn one attribute and object, ignoring the multiple attributes that an object naturally has. Therefore, the model is over - confident in known attributes and regards other attributes that may describe the object as negative attributes, resulting in a decline in performance on unseen combinations. To overcome these problems, the paper proposes a new framework, called TRIDENT (Multi - modal Large Language Model Embedding and Attribute - smoothing - guided Disentanglement), which mainly includes three modules: visual feature extraction, attribute - object disentanglement, and feature alignment. Through these modules, TRIDENT aims to reduce the influence of the background, use multi - granularity features for disentanglement, generate more powerful word embeddings using multi - modal large language models (MLLM), and reduce the over - confidence of the model through attribute - smoothing techniques, thereby improving the ability to recognize unseen combinations.

Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Disentangling Before Composing: Learning Invariant Disentangled Features for Compositional Zero-Shot Learning

Cross-composition Feature Disentanglement for Compositional Zero-shot Learning

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

MRSP: Learn Multi-Representations of Single Primitive for Compositional Zero-Shot Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

Compositional Zero-shot Learning Via Progressive Language-based Observations

Learning Conditional Attributes for Compositional Zero-Shot Learning

Learning to Embed Seen/Unseen Compositions based on Graph Networks

Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning

LVAR-CZSL: Learning Visual Attributes Representation for Compositional Zero-Shot Learning

Learning Attention as Disentangler for Compositional Zero-shot Learning

Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training

Manifold Regularized Cross-Modal Embedding for Zero-Shot Learning

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning.

Continual Compositional Zero-Shot Learning

Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning

Zero-Shot Leaning With Manifold Embedding

Retrieval-Augmented Primitive Representations for Compositional Zero-Shot Learning

Learning complementary semantic information for zero-shot recognition