Combined query image retrieval based on hybrid coding of CNN and Mix-Transformer

Zhiwei Zhang,Shuli Cheng,Liejun Wang
DOI: https://doi.org/10.1016/j.eswa.2023.121060
IF: 8.5
2023-01-01
Expert Systems with Applications
Abstract:Convolutional Neural Network (CNN) is commonly used to extract reference image features in combined query image retrieval algorithms. CNN extracts image features by calculating the relationship between pixels in a small area through convolution kernel, which covers only a small part of the image each time. Therefore, CNN pays more attention to the local information of the image, which is easy to cause the loss of global information of the image. In order to enrich the feature information of reference images and improve the retrieval performance of network model, this paper proposes the following improvements: (1) Multi-Headed Self-Attention (MSA) in Vision Transformer (ViT) only pays attention to the correlation within samples, ignoring the correlation between samples, this makes it is difficult to obtain a global condition of the dataset, which limits the ability of ViT to capture reference image information. To solve this problem, we propose Group External Attention (GEA) module and replace MSA in ViT with it, thus proposing Mix-Transformer. Further, we build a hybrid network of CNN and Mix-Transformer to capture the reference image features. (2) We introduce Shuffle Attention (SA) module to reaggregate reference image features extracted by CNN in groups by channel, so that information can flow between different groups, which can effectively enhance the richness and relevance of reference image feature information. A large number of experiments have been conducted on Fashion200k and MIT-States publicly available datasets, and experimental results confirm our algorithm’s outstanding performance.
What problem does this paper attempt to address?