Bi-Modal Learning with Channel-Wise Attention for Multi-Label Image Classification

Peng Li,Peng Chen,Yonghong Xie,Dezheng Zhang
DOI: https://doi.org/10.1109/access.2020.2964599
IF: 3.9
2020-01-01
IEEE Access
Abstract:Multi-label image classification is more in line with the real-world applications. This problem is difficult due to the the fact that complex label space makes it hard to get label-level attention regions and deal with semantic relationships among labels. Common deep network-based methods utilize CNN to extract features and consider the labels as a sequence or a graph, thus handling the label correlations with RNN or graph-theoretical algorithms. In this paper, we propose a novel CNN-RNN-based model, bi-modal multi-label learning(BMML) framework. Firstly, an improved channel-wise attention mechanism is presented to propose regional attention maps and connect them to relative labels. After that, based on the assumption that objects in a semantic scene always have high-level relevance among visual and textual corpus, we further embed the labels through different pre-trained language models and determine the label sequence in a “semantic space” constructed on large-scale textual data, thereby handling the labels in their semantic context. In addition, a cross-modal feature aligning module is introduced in BMML framework. Experimental results show that BMML is able to achieve better accuracies then those mainstream multi-label classification methods on several benchmark data sets.
What problem does this paper attempt to address?