Engaging Live Video Comments Generation
Ge Luo,Yuchen Ma,Manman Zhang,Junqiang Huang,Sheng Li,Zhenxing Qian,Xinpeng Zhang
DOI: https://doi.org/10.1145/3664647.3681195
2024-01-01
Abstract:Multimodal Dialogue agents are often required to respond to conversation history using both textual and visual content. Even though current dialogue studies predominantly strive to generate natural texts or images, they fall short in considering the relevance of multimodal responses within a dialogue context, consequently confining agents from making prudent choices based on multiple alternatives and their associated relevance scores for decision-making. In this paper, we present a bidirectional multimodal dialogue framework that skillfully combines the forward generation of multiple text and image response candidates with reverse selection guided by relevance scores evaluated on dialogue context, facilitating agents in selecting the most suitable multimodal responses. Specifically, the forward generation aspect of our framework leverages a stage-wise approach, first producing textual replies and composite visual descriptions from the dialogue context, followed by the generation of visual responses aligned with the descriptions. In the reverse selection process, visual responses are translated into tangible descriptive texts that, in conjunction with textual responses, are inversely tied back to the dialogue context for relevance assessment, assigning a reference score to each multimodal response candidate to assist the intelligent agent in making informed decisions. Experimental outcomes demonstrate that our proposed bidirectional dialogue response framework markedly elevates performance in both automatic and human evaluations, yielding a range of contextually fitting multimodal responses for selection.