Distance Restricted Transformer Encoder for Multi-Label Classification

Xiaomei Wang,Yaqian Li,Tong Luo,Yandong Guo,Yanwei Fu,Xiangyang Xue
DOI: https://doi.org/10.1109/icme51207.2021.9428164
2021-01-01
Abstract:Multi-label image classification is a fundamental but challenging task in Multimedia community. It aims to predict a set of labels presented in an image. Great progress has been made by exploring convolutional neural network with binary cross-entropy loss recently. However, conventional approaches are limited to highlight the key visual contents associated with target labels and pay little attention to confining the distances between visual and positive/negative label representations. To target these aspects, we firstly introduce a variant transformer encoder model for acquiring the underlying and crucial visual information related to ground truth labels. Specifically, a novel primal feature guided net is designed to maintain the original visual features during encoding process. Secondly, we exploit a distance restricted learning strategy in a common semantic space to shrink the distances of images with positive labels while expand with the negative ones during training stage. Extensive experiments are executed on MSCOCO and WIDER Attribute datasets and outstanding performance is achieved compared with other state-of-the-art models.
What problem does this paper attempt to address?