Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval

Ying Ma,Meng Wang,Guangyun Lu,Yajun Sun
DOI: https://doi.org/10.1007/s00371-024-03496-y
IF: 2.835
2024-06-12
The Visual Computer
Abstract:Cross-modal hashing has attracted widespread attention due to its ability to reduce the complexity of storage and retrieval. However, many existing methods use a symbolic function to map hash codes, which leads to a loss of semantic information when mapping the original features to a low-dimensional space and consequently decreases retrieval accuracy. To address these challenges, we propose a cross-modal hashing method called Multi-Label Semantic Sharing based on Graph Convolutional Network for Image-to-Text Retrieval (MLSS). Specifically, we employ dual transformers to encode multimodal data and utilize CNN to assist in extracting local information from images, thereby enhancing the matching capability between images and text. Additionally, we design a multi-label semantic sharing module based on a graph convolutional network, which learns a unified multi-label classifier and establishes a semantic bridge between the feature representation space and the hashing space for images and text. By leveraging multi-label semantic information to guide feature and hash learning, MLSS generates hash codes that preserve semantic similarity information, leading to a significant improvement in the performance of image-to-text retrieval. Our experiments on three benchmark datasets demonstrate that MLSS outperforms several state-of-the-art cross-modal retrieval methods. Our code can be found at https://github.com/My1new/MLSS.
computer science, software engineering
What problem does this paper attempt to address?