Semantic-Guided Representation Enhancement for Multi-Label Image Classification
Xuelin Zhu,Jianshu Li,Jiuxin Cao,Dongqi Tang,Jian Liu,Bo Liu
DOI: https://doi.org/10.1109/tcsvt.2024.3408256
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Multi-label image classification is an essential yet challenging task that requires to recognize multiple objects of images. To this end, recent studies have sought to acquire visual representations for each label by attention models, and then train binary classifiers for prediction. However, these methods have two major drawbacks: 1) They rely heavily on the precise alignments between two modalities, which is still challenging for current attention models; 2) They ignore patch-level representations rich in local object features, which are also of great importance for label recognition. In this paper, we propose a semantic-guided representation enhancement framework, which augments patch-level representations with object-level representations for robust label recognition. Concretely, the proposed framework consists of two significant components: 1) an inter-modal attention module that accounts for coarsely locating object regions and producing object-level representations for each label; 2) an intra-modal attention module that aggregates object representations to enhance patch representations based on their correlations. In this way, both local clues and global glances of objects are fully exploited simultaneously, rather than relying solely on object-level representations obtained by the inter-modal attention, thus improving the performance of label recognition. Experimental results show that our framework outperforms the state-of-the-art methods by 0.5%, 0.6%, 0.7% and 0.8% in mAP on Pascal VOC 2007, Microsoft COCO, NUS-WIDE and Visual Genome datasets, respectively. Codes and models are available on https://github.com/jasonseu/SGRE.