Abstract:Multimodal Dialogue agents are often required to respond to conversation history using both textual and visual content. Even though current dialogue studies predominantly strive to generate natural texts or images, they fall short in considering the relevance of multimodal responses within a dialogue context, consequently confining agents from making prudent choices based on multiple alternatives and their associated relevance scores for decision-making. In this paper, we present a bidirectional multimodal dialogue framework that skillfully combines the forward generation of multiple text and image response candidates with reverse selection guided by relevance scores evaluated on dialogue context, facilitating agents in selecting the most suitable multimodal responses. Specifically, the forward generation aspect of our framework leverages a stage-wise approach, first producing textual replies and composite visual descriptions from the dialogue context, followed by the generation of visual responses aligned with the descriptions. In the reverse selection process, visual responses are translated into tangible descriptive texts that, in conjunction with textual responses, are inversely tied back to the dialogue context for relevance assessment, assigning a reference score to each multimodal response candidate to assist the intelligent agent in making informed decisions. Experimental outcomes demonstrate that our proposed bidirectional dialogue response framework markedly elevates performance in both automatic and human evaluations, yielding a range of contextually fitting multimodal responses for selection.

Towards Situated Dialogue: Revisiting Referring Expression Generation.

Modeling Collaborative Referring for Situated Referential Grounding

Perspective-Corrected Spatial Referring Expression Generation for Human-Robot Interaction

Towards Mediating Shared Perceptual Basis in Situated Dialogue.

Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Grounding Language in Multi-Perspective Referential Communication

Whether you can locate or not? Interactive Referring Expression Generation

Reference-Centric Models for Grounded Collaborative Dialogue

Ambiguities in Spatial Language Understanding in Situated Human Robot Dialogue.

Speaking Your Language: Spatial Relationships in Interpretable Emergent Communication

Referring to the recently seen: reference and perceptual memory in situated dialog

Awareness of Partner ’ s Eye Gaze in Situated Referential Grounding : An Empirical Study

Position-Aware Attention Mechanism–Based Bi-graph for Dialogue Relation Extraction

Integrating Word Acquisition and Referential Grounding Towards Physical World Interaction

Regularizing Dialogue Generation by Imitating Implicit Scenarios

Reference Resolution and Context Change in Multimodal Situated Dialogue for Exploring Data Visualizations

Forward Creation, Reverse Selection: Achieving Highly Pertinent Multimodal Responses in Dialogue Contexts

Socaog: Incremental Graph Parsing For Social Relation Inference In Dialogues

A Unified Mutual Supervision Framework for Referring Expression Segmentation and Generation

Engaging Live Video Comments Generation