Internship: probing joint vision-and-language representations

Emmanuelle Salin,Stephane Ayache,Benoit Favre,Stanislaw Antol,Aishwarya Agrawal,Jiasen Lu,Margaret Mitchell,Dhruv Batra,Yen-Chun Chen,Linjie Li,Licheng Yu,Ahmed El Kholy,Faisal Ahmed,Zhe Gan,Yu Cheng,Jingjing Liu,Kevin Clark,Urvashi Khandelwal,Omer Levy,Abhishek Das,Satwik Kottur,Khushi Gupta,Avi Singh
2020-01-01
Abstract:Context Recent advances in deep learning have enabled exciting applications in the context of multimodal processing involving images and texts, such as visual question answering [1], visual dialog [4], image captionning [14], text undersanding in multimodal context [5]... This internship is focused on exploring representations trained to perform such tasks. Vision-and-language representations are typically extracted with neural networks drawing from the transformers architecture, pre-trained with self-supervision on large datasets, such as Conceptual Captions [11] or MSCOCO [10]. The resulting family of architectures generally involve representing objects extracted from the image as embeddings, and concatening them with word embeddings associated to the text, before feeding them to multiple layers of attention mechanisms [12, 13, 9, 2].
What problem does this paper attempt to address?