COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation

Keyu Wen,Jin Xia,Yuanyuan Huang,Linyang Li,Jiayan Xu,Jie Shao
DOI: https://doi.org/10.1109/iccv48922.2021.00221
2021-01-01
Abstract:There has been a recent surge of interest in cross-modal pre-training. However, existed approaches pre-train a one-stream model to learn joint vision-language representation, which suffers from calculation explosion when conducting cross-modal retrieval. In this work, we propose the Contrastive Cross-Modal Knowledge Sharing Pretraining (COOKIE) method to learn universal text-image representations. There are two key designs in it, one is the weight-sharing transformer on top of the visual and textual encoders to align text and image semantically, the other is three kinds of contrastive learning designed for sharing knowledge between different modalities. Cross-modal knowledge sharing greatly promotes the learning of unimodal representation. Experiments on multi-modal matching tasks including cross-modal retrieval, text matching, and image retrieval show the effectiveness and efficiency of our pre-training framework. Our COOKIE finetuned on cross-modal datasets MSCOCO, Flickr30K, and MSRVTT achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models. There are also 5.7% and 3.9% improvements in the task of image retrieval and text matching. Source code will be available at https://github.com/kywen1119/COOKIE.
What problem does this paper attempt to address?