Abstract:Pattern images are artificially designed images that possess distinctiveness in their elements, styles, and arrangements. With the ever-growing number of pattern images, pattern image retrieval emerges as a promising technique with significant potential for commercial and industrial applications, such as fashion and home decoration, facilitating rapid identification of preferred print patterns by users. The main purpose of multi-label pattern image retrieval is to effectively represent and match images with their corresponding labels. Compared to conventional image retrieval, multi-label pattern image retrieval faces greater challenges due to the richer semantic information contained within the abstract print patterns and the complex relationships between multiple labels. To tackle these challenges, we propose a model specifically designed for multi-label pattern image retrieval, called Tran-GCN. Our proposed model is built upon a Transformer-based autoregressive architecture, which leverages image information to guide the exploration of correlations between different labels through the textual modality. By utilizing this correlation information, we construct a graph convolutional network (GCN) model to further enhance the correlations between image and label representations. To be more specific, our Tran-GCN model utilizes a cross-modal attention mechanisms at each layer to effectively aggregate visual features from the input image and update label semantics through residual connections. The GCN module is updated based on the correlation between textual features, as represented in a relationship matrix. Extensive experiments on two widely used public visual benchmarks, MS-COCO and NUS-WIDE, as well as a multi-label pattern image dataset, Pattern 2, consistently demonstrate the ability of our proposed Tran-GCN model for general use and its superior performance in multi-label pattern image retrieval tasks as well.

M3TR: Multi-modal Multi-label Recognition with Transformer.

Transformer-based Dual Relation Graph for Multi-label Image Recognition

SST: Spatial and Semantic Transformers for Multi-Label Image Recognition

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Transformer-based Multi-Modal Learning for Multi Label Remote Sensing Image Classification

Feature Learning Network with Transformer for Multi-Label Image Classification

Disentangling 3D/4D Facial Affect Recognition with Faster Multi-View Transformer

Multi-input trademark element recognition with transformer

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Asymmetric Vision Transformers for Multi-Label Classification

A Multi-label Image Recognition Algorithm Based on Spatial and Semantic Correlation Interaction.

Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension

Imvs: Integrating Multi-View Information on Multiple Scales for 3D Object Recognition

Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition.

Flat Multi-modal Interaction Transformer for Named Entity Recognition.

Transformer Driven Matching Selection Mechanism for Multi-Label Image Classification

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

Multimodal Transformer for Automatic 3D Annotation and Object Detection

Tran-GCN: Multi-label Pattern Image Retrieval via Transformer Driven Graph Convolutional Network

Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition

Multimodal Transformer With Multi-View Visual Representation for Image Captioning