MAC: A Benchmark for Multiple Attributes Compositional Zero-Shot Learning

Shuo Xu,Sai Wang,Xinyue Hu,Yutian Lin,Bo Du,Yu Wu
2024-06-19
Abstract:Compositional Zero-Shot Learning (CZSL) aims to learn semantic primitives (attributes and objects) from seen compositions and recognize unseen attribute-object compositions. Existing CZSL datasets focus on single attributes, neglecting the fact that objects naturally exhibit multiple interrelated attributes. Real-world objects often possess multiple interrelated attributes, and current datasets' narrow attribute scope and single attribute labeling introduce annotation biases, undermining model performance and evaluation. To address these limitations, we introduce the Multi-Attribute Composition (MAC) dataset, encompassing 18,217 images and 11,067 compositions with comprehensive, representative, and diverse attribute annotations. MAC includes an average of 30.2 attributes per object and 65.4 objects per attribute, facilitating better multi-attribute composition predictions. Our dataset supports deeper semantic understanding and higher-order attribute associations, providing a more realistic and challenging benchmark for the CZSL task. We also develop solutions for multi-attribute compositional learning and propose the MM-encoder to disentangling the attributes and objects.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of existing Compositional Zero - Shot Learning (CZSL) datasets in handling multi - attribute combinations. Specifically, the existing CZSL datasets mainly focus on single attributes, ignoring the fact that objects in the real world usually have multiple inter - related attributes. This has led to annotation bias and weakened the model performance and the accuracy of evaluation. To address these challenges, the authors introduced a new dataset - the Multi - Attribute Composition (MAC) dataset. This dataset contains 18,217 images and 11,067 combinations, and each combination describes an object and its multiple attributes. The characteristics of the MAC dataset include: 1. **Comprehensive and diverse attribute annotation**: Each object has an average of 30.2 attributes, and each attribute covers an average of 65.4 objects. 2. **Deeper semantic understanding**: Supports higher - order attribute associations, providing a more realistic and challenging benchmark for CZSL tasks. 3. **Addressing the limitations of existing datasets**: Promotes the development of CZSL tasks by introducing more - dimensional attributes (such as state and nature), not just surface attributes such as color and material. In addition, the authors also developed a method for multi - attribute combination learning and proposed the MM - encoder to decouple attributes and objects, thereby achieving state - of - the - art performance. ### Main contributions of the paper 1. **Constructing the MAC dataset**: - It contains 18,217 images and 11,067 combinations, covering rich, representative, and diverse attribute annotations. - Each combination describes an object and its multiple attributes, supporting more comprehensive multi - attribute combination prediction. 2. **Proposing the MM - encoder model**: - Uses two - branch prompt tuning to decouple attributes and objects. - Through the multi - modal adaptation method, models the relationships between different semantic primitives (attributes and objects) and the relationships between images and text. 3. **Improving the evaluation of CZSL tasks**: - Introduces the multi - label single - attribute combination classification task to better evaluate the performance of the model in closed - world and open - world settings. - Proposes a normalized coverage metric to avoid the problem of excessive values in traditional coverage metrics in combination classification. ### Summary This paper aims to solve the limitations of existing CZSL datasets in handling multi - attribute combinations and promote the development of the compositional zero - shot learning field by introducing the MAC dataset and the MM - encoder model. Through more comprehensive and diverse attribute annotations and deeper semantic understanding, the MAC dataset provides a more realistic and challenging benchmark for CZSL tasks.