MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation

Chuhan Wu,Fangzhao Wu,Tao Qi,Chao Zhang,Yongfeng Huang,Tong Xu
DOI: https://doi.org/10.1145/3477495.3531896
2022-01-01
Abstract:News representation is critical for news recommendation. Most existing methods learn news representations only from news texts while ignoring the visual information of news. In fact, users may click news not only due to the interest in news titles but also the attraction of news images. Thus, images are useful for representing news and predicting news clicks. Pretrained visiolinguistic models are powerful in multi-modal understanding, which can represent news from both textual and visual contents. In this paper, we propose a multimodal news recommendation method that can incorporate both textual and visual information of news to learn multimodal news representations. We first extract region-of-interests (ROIs) from news images via object detection. We then use a pre-trained visiolinguistic model to encode both news texts and image ROIs and model their inherent relatedness using co-attentional Transformers. In addition, we propose a crossmodal candidate-aware attention network to select relevant historical clicked news for the accurate modeling of user interest in candidate news. Experiments validate that incorporating multimodal news information can effectively improve the performance of news recommendation.
What problem does this paper attempt to address?