G-CAM: Graph Convolution Network Based Class Activation Mapping for Multi-label Image Recognition.

Yangtao Wang,Yanzhao Xie,Yu Liu,Lisheng Fan
DOI: https://doi.org/10.1145/3460426.3463620
2021-01-01
Abstract:In most multi-label image recognition tasks, human visual perception keeps consistent for different spatial transforms of the same image. Existing approaches either learn the perceptual consistency with only image-level supervision or preserve the middle-level feature consistency of attention regions but neglect the (global) label dependencies between different objects over the dataset. To address this issue, we integrate graph convolution network (GCN) and propose G-CAM, which learns visual attention consistency via GCN based class attention mapping (CAM) for multi-label image recognition. G-CAM consists of an image feature extraction module to generate the feature maps of the original image and its transformed one and a GCN module to learn weighted classifiers that capture the label dependencies between different objects. Different from previous works which use fully-connected classification layer, G-CAM first fuses weighted classifiers with the feature vector to generate the predicted labels for each input image, then combines weighted classifiers with the feature maps to respectively obtain the transformed attention heatmaps of the original image and the attention heatmaps of its transformed one. We can compute the attention consistency loss according to the distance between these two attention heatmaps. Finally, this loss is combined with the multi-label classification loss to update the whole network in an end-to-end manner. We conduct extensive experiments on three multi-label image datasets including FLICKR25K, MS-COCO and NUS-WIDE. Experimental results demonstrate G-CAM can achieve better performance compared with the state-of-the-art multi-label image recognition methods.
What problem does this paper attempt to address?