End-to-End Large-Scale Image Retrieval Network with Convolution and Vision Transformers

Qing Zhang,Feilong Bao,Xiangdong Su,Weihua Wang,Guanglai Gao
DOI: https://doi.org/10.1007/978-3-031-15937-4_52
2022-01-01
Abstract:There has been significant progress in content-based image retrieval with the development of convolutional neural networks and visual transformers. However, there are semantic gaps between highlevel semantic information and low-level visual features. To solve this problem, we propose a high-performance image retrieval method based on the convolutional neural network (CNN) and vision transformers, which takes advantage of the local characteristics of the CNN and the long-range dependence characteristics of vision transformers. The proposed convolution and vision transformers network (CVTNet) firstly uses the CNN backbone network to extract the feature representation of the image. Secondly, it uses the vision transformers to enhance the semantic relationship among the feature layer to reduce the semantic gap. Finally, we propose an adaptive weight loss function that fuses triplet loss and second-order similarity loss to capture more image structure information. Extensive experimental results demonstrated that CVTNet achieves significant performance improvement on Revisited Oxford and Paris datasets compared with the baselines.
What problem does this paper attempt to address?