Abstract:Multi-label image classification is a fundamental yet challenging task, which aims to predict the labels associated with a given image. Most of previous methods directly exploit the high-level features from the last layer of convolutional neural network for classification. However, these methods cannot obtain global features due to the limited size of convolutional kernels, and they fail to extract multi-scale features to effectively recognize small-scale objects in the images. Recent studies exploit the graph convolution network to model the label correlations for boosting the classification performance. Despite substantial progress, these methods rely on manually pre-defined graph structures. Besides, they ignore the associations between semantic labels and image regions, and do not fully explore the spatial context of images. To address above issues, we propose a novel Dual Attention Transformer (DATran) model, which adopts a dual-stream architecture that simultaneously learns spatial and channel correlations from multi-label images. Firstly, in order to solve the problem that current methods are difficult to recognize small-size objects, we develop a new multi-scale feature fusion (MSFF) module to generate multi-scale feature representation by jointly integrating both high-level semantics and low-level details. Secondly, we design a prior-enhanced spatial attention (PSA) module to learn the long-range correlation between objects from different spatial positions in images to enhance the model performance. Thirdly, we devise a prior-enhanced channel attention (PCA) module to capture the inter-dependencies between different channel maps, thus effectively improving the correlation between semantic categories. It is worth noting that PSA module and PCA module complement and promote each other to further augment the feature representations. Finally, the outputs of these two attention modules are fused to obtain the final features for classification. Performance evaluation experiments are conducted on MS-COCO 2014, PASCAL VOC 2007 and VG-500 datasets, demonstrating that DATran model achieves better performance than current state-of-the-art models.

Attention-Augmented Memory Network for Image Multi-Label Classification

Double Attention for Multi-Label Image Classification.

Double Attention Based on Graph Attention Network for Image Multi-Label Classification

Spatial Context-Aware Object-Attentional Network for Multi-Label Image Classification

Multi-Label Image Classification by Feature Attention Network

An Attention-Driven Multi-label Image Classification with Semantic Embedding and Graph Convolutional Networks.

Label-Guided Cross-Modal Attention Network for Multi-Label Aerial Image Classification

Multi-Label Image Classification with Attention Mechanism and Graph Convolutional Networks

Multi-Scale Cross-Modal Spatial Attention Fusion for Multi-label Image Recognition.

Relation Network for Multi-label Aerial Image Classification

Multi-Label Continual Learning Using Augmented Graph Convolutional Network

DATran: Dual Attention Transformer for Multi-Label Image Classification

Visual Attention in Multi-Label Image Classification.

Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition

Bi-Modal Learning with Channel-Wise Attention for Multi-Label Image Classification

A multi-label image classification method combining multi-stage image semantic information and label relevance

Research of multi-label text classification based on label attention and correlation networks

Graph Attention Mechanism with Global Contextual Information for Multi-Label Image Recognition

Multi-label Image Recognition by Recurrently Discovering Attentional Regions

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition