KB-VLP: Knowledge Based Vision and Language Pretraining

Kezhen Chen,Qiuyuan Huang,Yonatan Bisk,Daniel McDuff,Jianfeng Gao
2021-01-01
Abstract:Transformer-based pretraining techniques have achieved impressive performance on learning cross-model representations for various multimodal tasks. However, off-the-shelf models do not take advantage of commonsense knowledge and logical reasoning that are crucial to many realworld tasks. To this end, we introduce a novel pretraining approach Knowledge Based Vision and Language Pretraining (KB-VLP) which uses knowledge graph embeddings extracted from text and detected image object tags to enhance the learning of semantically aligned and knowledgeaware representations, and improve the models generalization, and interpretability. KB-VLP is pretrained on a large image-text corpus and automatically extracted knowledge embeddings, and then finetuned on several downstream visionlanguage tasks. Experiments show that KB-VLP significantly improves the performance on VQA, GQA, NLVR and OKVQA tasks compared with the baselines.
What problem does this paper attempt to address?