Tri-Modal Transformers with Mixture-of-Modality-Experts for Social Media Prediction
Weilong Chen,Wenhao Hu,Xiaolu Chen,Weimin Yuan,Yan Wang,Yanru Zhang,Zhu Han
DOI: https://doi.org/10.1109/tcsvt.2024.3474101
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:With billions of users worldwide, accurately predicting social media popularity is crucial for assessing user behavior, forecasting trends, and enhancing social interactions and business strategies. However, this task presents significant challenges. Firstly, the extraction of valuable insights is complicated by the presence of tri-modal data (visual, text, structured) and pervasive noise. Secondly, the applicability of knowledge acquired during the pre-training phase is often limited due to discrepancies with downstream prediction tasks during the fine-tuning phase. Existing methods for Social Media Popularity Prediction (SMPP), including traditional models and Visual-and-language Models (VLMs), struggle to overcome these challenges, thereby failing to achieve satisfactory accuracy. To tackle these challenges, we propose a novel approach named Tri-Modal Transformers with Mixture-of-Modality-Experts (TTME) for SMPP. TTME integrates Artificial Intelligence Generated Content to mitigate data noise and incorporate a mix of Modality Experts in pre-training phases to effectively utilize tri-modal data. Moreover, to address training disparity, we explore strategies for downstream task adaptation including the integration of diverse pre-training experts and the implementation of DistillSoftmax. Through empirical evaluation, we demonstrate that the TTME significantly improves the accuracy of social media popularity predictions, effectively utilizes tri-modal data with noise, and enhances transferring knowledge from pre-training to downstream tasks.