Iterative Uni-modal and Cross-modal Clustered Contrastive Learning for Image-text Retrieval

Yi Zhu,Xiu Li
DOI: https://doi.org/10.1109/PRMVIA58252.2023.00009
2023-01-01
Abstract:Multimedia data has exploded both in quantity and form. Under such background, cross-modal retrieval has become a research hot spot in recent years. We address the image-to-text and text-to-image retrieval problems by proposing a symmetric two-stream pre-training framework. In this work, the architecture is based on the CLIP model and it consists of a BERT-pretrained text encoder and a Vision Transformer (ViT)-pretrained image encoder. We utilize not only a cross-modal contrastive loss, but also two symmetric uni-modal contrast losses to train the model in an unsupervised manner. In addition, we propose novel training strategies, including the multi-stage training scheme and iterative training strategy with clustered hard negative data. Experimental results show that our model achieves better performance via introducing the uni-modal self-supervised branch and losses compared to the sole CLIP model.
What problem does this paper attempt to address?