UC: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training—-Supplement Material

Mingyang Zhou,Luowei Zhou,Shuohang Wang,Yu Cheng,Linjie Li,Zhou Yu,Jingjing Liu
2021-01-01
Abstract:Multilingual Image-Text Retrieval During fine-tuning, we train and evaluate the pre-trained UC on Multi30K [4, 3, 1] and MSCOCO [2, 8, 6]. When we fine-tune UC on both datasets, we use batch size of 40 and sample 2 negative image-text pairs for each sampled positive image-text pair. The pre-trained model is optimized by the Adam Optimizer with the learning rate set to 1e− 4 and a linear warm-up for the first 10% of fine-tuning. For Cross-Lingual zero-shot setting, the pre-trained UC is fine-tuned on English-only training data for 30K steps. For All-Language setting, we train UC on all the training data in all languages for 50K steps. The finetuning is run on 8 Nvidia V100 GPUs.
What problem does this paper attempt to address?