Hierarchical Multi-Modal Attention Network for Time-Sync Comment Video Recommendation

Weihao Zhao,Han Wu,Weidong He,Haoyang Bi,Hao Wang,Chen Zhu,Tong Xu,Enhong Chen
DOI: https://doi.org/10.1109/tcsvt.2023.3309768
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Due to inherent interactivity, time-sync comment of videos have attracted increasing attention and were widely adopted in online video platforms. In addition to enhancing user engagement, time-sync comments provide abundant semantic information that can greatly enhance video understanding, which however is largely overlooked in mainstream video recommender systems. To address this issue, we propose a Hierarchical Multi-modal Attention Network (HMAN) to effectively utilize time-sync comment for recommendation. Specifically, we design a Multi-level Text Condense (MTC) Module to capture the accurate semantics of time-sync comments via text-level and vision-level condense operations. Then we propose a Range Convolution Block (RCB) to capture both visual and textual information from variable-length event segments leveraging the variable respective field. After that, we design a Hierarchical Multi-modal Branch Fusion (HMBF) Module to obtain a comprehensive multi-modal representation of the time-sync comments video. Finally, with the obtained video representation, recommendation scores are obtained through its inner product with user embedding. Extensive experiments demonstrate the effectiveness of the proposed HMAN, and ablation studies on different variants of HMAN further validate the utility of each component and the necessity of the hierarchical multi-modal branch fusion method.
What problem does this paper attempt to address?