Regularizing Visual Semantic Embedding with Contrastive Learning for Image-Text Matching

Yang Liu,Hong Liu,Huaqiu Wang,Mengyuan Liu
DOI: https://doi.org/10.1109/lsp.2022.3178899
2022-01-01
IEEE Signal Processing Letters
Abstract:Learning visual semantic embedding for image-text matching has achieved high success by using triplet loss to pull positive image-text pairs which share similar semantic meaning and to push negative image-text pairs which share different semantic meaning. Without modeling constraints from image-image or text-text pairs, the generated visual semantic embedding inevitably faces the problem of semantic misalignments among similar images or among similar texts. To solve this problem, we present a contrastive visual semantic embedding framework, named ConVSE, which achieves intra-modal semantic alignment by contrastive learning from augmented image-image (or text-text) pairs and achieves inter-modal semantic alignment by applying hardest-negative-enhanced triplet loss on image-text pairs. To the best of our knowledge, we are the first to find that contrastive learning benefits visual semantic embedding. Extensive experiments on large-scale MSCOCO and Flickr30 K datasets verify the effectiveness of our proposed ConVSE by outperforming visual semantic embedding-based methods and achieving new state-of-the-art.
What problem does this paper attempt to address?