Dual-Stream Pre-Training Transformer to Enhance Multimodal Learning for Social Media Prediction

Wenhao Hu,Weilong Chen,Weimin Yuan,Yan Wang,Shimin Cai,Yanru Zhang
DOI: https://doi.org/10.1145/3664647.3688998
2024-01-01
Abstract:Social media has emerged as a vital platform for communication, information sharing, and acquisition. Predictive analysis of social media data has wide applications, such as sentiment examination and social network analysis. However, existing work often directly utilizes social media data for training, neglecting the issue of mismatched text and images. This neglect can lead to confusion about the contents, thereby affecting the identification of trending topics and the accuracy of social media predictions. In this paper, an approach named Dual-Stream Pre-training Transformer (DSPT) is introduced to address this gap. In DSPT, we use a Visual-Language Model (VLM) and a Language Model (LM) to separately learn from image and text data, mitigating the impact of text-image mismatches. Moreover, to enhance the understanding of the model to social media data, we conduct incremental pre-training for both models. To achieve better feature interaction, we construct an integrated regression module combining LightGBM and CatBoost, jointly predicting the extracted feature embeddings. This dual-stream multimodal feature extraction method improves the performance of predictive tasks. Experimental results validate the effectiveness of our approach, demonstrating its potential and providing deeper insights into multimodal data mining in social media.
What problem does this paper attempt to address?