Graph Attention Transformer Network for Multi-Label Image Classification

Jin Yuan,Shikai Chen,Yao Zhang,Zhongchao Shi,Xin Geng,Jianping Fan,Yong Rui
2024-01-15
Abstract:Multi-label classification aims to recognize multiple objects or attributes from images. However, it is challenging to learn from proper label graphs to effectively characterize such inter-label correlations or dependencies. Current methods often use the co-occurrence probability of labels based on the training set as the adjacency matrix to model this correlation, which is greatly limited by the dataset and affects the model's generalization ability. In this paper, we propose a Graph Attention Transformer Network (GATN), a general framework for multi-label image classification that can effectively mine complex inter-label relationships. First, we use the cosine similarity based on the label word embedding as the initial correlation matrix, which can represent rich semantic information. Subsequently, we design the graph attention transformer layer to transfer this adjacency matrix to adapt to the current domain. Our extensive experiments have demonstrated that our proposed methods can achieve state-of-the-art performance on three datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address key challenges in multi-label image classification, specifically how to effectively learn and utilize relationships and dependencies between labels. Specifically: 1. **Learning Label Relationships**: Multi-label image classification tasks require identifying multiple objects or attributes in an image. However, existing methods have limitations in learning the associations between labels. Traditional methods typically construct an adjacency matrix based on the co-occurrence probability of labels in the training set, which is limited by the dataset and affects the model's generalization ability. 2. **Modeling Label Relationships**: To overcome the above limitations, the paper proposes a new framework—Graph Attention Transformer Network (GATN). This framework effectively explores complex inter-label relationships through the following steps: - **Generation of Initial Correlation Matrix**: Using cosine similarity based on label word embeddings as the initial correlation matrix to represent rich semantic information. - **Design of Graph Attention Transformation Layer**: Designing a graph attention transformation layer to convert the initial correlation matrix into an adjacency matrix adapted to the current domain. 3. **Performance Improvement**: Through extensive experimental validation, the proposed GATN method achieves highly competitive performance on three datasets, demonstrating its effectiveness in multi-label image classification tasks. ### Main Contributions - **Proposed a Novel End-to-End GATN Framework**: By obtaining useful information from node representations through a self-attention branch, it more accurately identifies meta-paths of the graph. - **Initialization of Node Correlation Matrix**: Using label node embeddings to initialize the node correlation matrix in the graph, which has richer semantic information compared to traditional co-occurrence probability methods. - **Experimental Validation**: Experimental results on multiple datasets show that the proposed method has significant advantages in performance compared to existing methods. ### Method Overview 1. **Generation of Correlation Matrix**: Generating the initial correlation matrix based on the distance of node embeddings, and filtering and adjusting values through binarization and reweighting strategies. 2. **Graph Attention Transformation Layer**: Designing a graph attention transformation layer that converts the generated correlation matrix into a new graph structure through a self-attention mechanism, exploring new multi-hop paths. 3. **Graph Attention Transformer Network**: Combining the transformed adjacency matrix and node embeddings, using a graph convolutional network to learn useful representations of label nodes. ### Experimental Results - **VOC2007 Dataset**: GATN outperforms various existing methods on the mAP metric, with performance improvements exceeding 3% in certain categories (e.g., bottle, chair, and sofa). - **MS-COCO Dataset**: GATN performs excellently on almost all metrics, especially outperforming other methods by 12.2%, 13.1%, and 9.9% on mAP, CF1, and OF1 metrics, respectively. - **NUS-WIDE Dataset**: GATN achieves the best performance on mAP, CF1, and other metrics, and the second-best performance on CP, CR, OP, and OF1 metrics. ### Conclusion By proposing the GATN framework, the paper effectively addresses the challenge of learning label relationships in multi-label image classification, demonstrating its superior performance on different datasets.