Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Xudong Yan,Songhe Feng,Yang Zhang,Jian Yang,Yueguan Lin,Haojun Fei
2024-11-18
Abstract:Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the model's ability to recognize unseen attribute - object combinations in the Compositional Zero - Shot Learning (CZSL) task. Specifically, the paper points out that the current methods have the following three main problems: 1. **Difficulties in disentanglement**: Due to the influence of the background and the complex entanglement of attributes and objects in the same part, the existing disentanglement methods are not effective. The model tends to extract background features as exclusive features and faces significant challenges when performing disentanglement at the spatial level, because the spatial features of attributes and objects often overlap. 2. **Insufficient word embeddings**: The existing word embedding methods (such as Word2Vec and GloVe) are unable to capture complex multi - modal semantic information. These word embeddings are mainly based on word frequency and context co - occurrence and lack the ability to capture high - level semantic nuances. In addition, these word embeddings are trained only in a single text mode and cannot capture cross - modal information between images and text. 3. **Over - confidence**: Existing models show over - confidence in the seen combinations, which hinders their generalization ability to new combinations. Since one - hot labels are used in the training process, these models can only learn one attribute and object, ignoring the multiple attributes that an object naturally has. Therefore, the model is over - confident in known attributes and regards other attributes that may describe the object as negative attributes, resulting in a decline in performance on unseen combinations. To overcome these problems, the paper proposes a new framework, called TRIDENT (Multi - modal Large Language Model Embedding and Attribute - smoothing - guided Disentanglement), which mainly includes three modules: visual feature extraction, attribute - object disentanglement, and feature alignment. Through these modules, TRIDENT aims to reduce the influence of the background, use multi - granularity features for disentanglement, generate more powerful word embeddings using multi - modal large language models (MLLM), and reduce the over - confidence of the model through attribute - smoothing techniques, thereby improving the ability to recognize unseen combinations.