Abstract:Multimodal Dialogue agents are often required to respond to conversation history using both textual and visual content. Even though current dialogue studies predominantly strive to generate natural texts or images, they fall short in considering the relevance of multimodal responses within a dialogue context, consequently confining agents from making prudent choices based on multiple alternatives and their associated relevance scores for decision-making. In this paper, we present a bidirectional multimodal dialogue framework that skillfully combines the forward generation of multiple text and image response candidates with reverse selection guided by relevance scores evaluated on dialogue context, facilitating agents in selecting the most suitable multimodal responses. Specifically, the forward generation aspect of our framework leverages a stage-wise approach, first producing textual replies and composite visual descriptions from the dialogue context, followed by the generation of visual responses aligned with the descriptions. In the reverse selection process, visual responses are translated into tangible descriptive texts that, in conjunction with textual responses, are inversely tied back to the dialogue context for relevance assessment, assigning a reference score to each multimodal response candidate to assist the intelligent agent in making informed decisions. Experimental outcomes demonstrate that our proposed bidirectional dialogue response framework markedly elevates performance in both automatic and human evaluations, yielding a range of contextually fitting multimodal responses for selection.

Building Goal-Oriented Dialogue Systems with Situated Visual Context

Learning through Dialogue Interactions by Asking Questions

Planning for Goal-Oriented Dialogue Systems

Modeling Intent, Dialog Policies and Response Adaptation for Goal-Oriented Interactions

What Should I Ask? Using Conversationally Informative Rewards for Goal-Oriented Visual Dialog

Towards Visual Dialogue for Human-Robot Interaction

Context-based Word Acquisition for Situated Dialogue in a Virtual World

Generating Dialogue Agents via Automated Planning

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Introducing Brain-like Concepts to Embodied Hand-crafted Dialog Management System

Engaging Live Video Comments Generation

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM

Towards a Progression-Aware Autonomous Dialogue Agent

CHAT: a Conversational Helper for Automotive Tasks

Forward Creation, Reverse Selection: Achieving Highly Pertinent Multimodal Responses in Dialogue Contexts

Task Learning Through Visual Demonstration and Situated Dialogue.

Knowledge-aware Multimodal Dialogue Systems.

Exploring Context-Aware Conversational Agents in Software Development

ARCADE: An Augmented Reality Display Environment for Multimodal Interaction with Conversational Agents

Data-Driven Dialogue Systems for Social Agents