Title-and-Tag Contrastive Vision-and-Language Transformer for Social Media Popularity Prediction

Weilong Chen,Chenghao Huang,Weimin Yuan,Xiaolu Chen,Wenhao Hu,Xinran Zhang,Yanru Zhang
DOI: https://doi.org/10.1145/3503161.3551568
2022-01-01
Abstract:Social media is an indispensable part of modern life, and social media popularity prediction (SMPP) plays a vital role in practice. In current work, the inconsistency of words in labels and titles, user feature transformation, etc have not been well noticed. In this paper, we propose a novel approach named Title-and-Tag Contrastive Vision-and-Language Transformer (TTC-VLT), combining two pre-trained vision and language transformers and other two dense feature parts for this prediction task. On one hand, in order to learn the differences between titles and tags, we design title-tag contrastive learning for title-visual and tag-visual, which separately extracts multimodal information from two types of text. On the other hand, user identification features are transformed to embedding vectors to capture user attribute details. From the extensive experiments, our approach outperforms the other methods on the social media prediction dataset. Our team achieve the 2nd place on the leader board of the Social Media Prediction Challenge 2022.
What problem does this paper attempt to address?