VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal Contexts

Manman Zhang,Ge Luo,Yuchen Ma,Sheng Li,Zhenxing Qian,Xinpeng Zhang
DOI: https://doi.org/10.1145/3581783.3612078
2023-01-01
Abstract:Live video commenting, or "bullet screen," is a popular social style on video platforms. Automatic live commenting has been explored as a promising approach to enhance the appeal of videos. However, existing methods neglect the diversity of generated sentences, limiting the potential to obtain human-like comments. In this paper, we introduce a novel framework called "VCMaster" for multimodal live video comments generation, which balances the diversity and quality of generated comments to create human-like sentences. We involve images, subtitles, and contextual comments as inputs to better understand complex video contexts. Then, we propose an effective Hierarchical Cross-Fusion Decoder to integrate high-quality trimodal feature representations by cross-fusing critical information from previous layers. Additionally, we develop a Sentence-Level Contrastive Loss to enlarge the distance between generated and contextual comments by contrastive learning. It helps the model to avoid the pitfall of simply imitating provided contextual comments and losing creativity, encouraging the model to achieve more diverse comments while maintaining high quality. We also construct a large-scale multimodal live video comments dataset with 292,507 comments and three sub-datasets that cover nine general categories. Extensive experiments demonstrate that our model achieves a level of human-like language expression and remarkably fluent, diverse, and engaging generated comments compared to baselines.
What problem does this paper attempt to address?