Enhancing Sentence Representation with Visually-supervised Multimodal Pre-training

Zhe Li,Laurence T. Yang,Xin Nie,Bocheng Ren,Xianjun Deng
DOI: https://doi.org/10.1145/3581783.3612254
2023-01-01
Abstract:Large-scale pre-trained language models have garnered significant attention in recent years due to their effectiveness in extracting sentence representations. However, most pre-trained models currently use transformer-based encoder with a single modality and are primarily designed for specific tasks such as natural language inference and question-answering. Unfortunately, this approach neglects the complementary information provided by multimodal data, which can enhance the effectiveness of sentence representation. To address this issue, we propose a Visually-supervised Pre-trained Multimodal Model (ViP) for sentence representation. Our model leverages diverse label-free multimodal proxy tasks to embed visual information into language, facilitating effective modality alignment and complementarity exploration. Additionally, our model utilizes a novel approach to distinguish highly similar negative and positive samples. We conduct comprehensive downstream experiments on natural language understanding and sentiment classification, demonstrating that ViP outperforms both existing unimodal and multimodal pre-trained models. Our contributions include a novel approach to multimodal pre-training and a state-of-the-art model for sentence representation that incorporates visual information.1 Our code is available at https://github.com/gentlefress/ViP
What problem does this paper attempt to address?