Multi-level network based on transformer encoder for fine-grained image–text matching
Lei Yang,Yong Feng,Mingliang Zhou,Xiancai Xiong,Yongheng Wang,Baohua Qiang
DOI: https://doi.org/10.1007/s00530-023-01079-w
IF: 3.9
2023-04-12
Multimedia Systems
Abstract:Enabling image–text matching is important to understand both vision and language. Existing methods utilize the cross-attention mechanism to explore deep semantic information. However, the majority of these methods need to perform two types of alignment, which is extremely time-consuming. In addition, current methods do not consider the digital information within the image or text, which may lead to a reduction in retrieval performance. In this paper, we propose a multi-level network, which is based on the transformer encoder for fine-grained, image–text matching. First, we use the transformer encoder to extract intra-modality relations within the image and text and perform the alignment through an efficient aggregating method, rendering the alignment more efficient and the intra-modality information fully utilized. Second, we capture the discriminative digital information within the image and text to make the representation more distinguishable. Finally, we utilize the global information of the image and text as complementary information to enhance the representation. According to our experimental results, significant improvements in terms of retrieval tasks and runtime estimation can be achieved compared with state-of-the-art algorithms. The source code is available at https://github.com/CQULab/MNTE.
computer science, information systems, theory & methods