Abstract:The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore how to achieve harmonious interaction between humans and machines through a Visual-Context Augmented Dialogue System (VAD). Specifically, the paper focuses on the following major issues: 1. **Limitations of Traditional Text-Based Dialogue Systems**: - Traditional text-based dialogue systems struggle to meet the increasingly complex demands of multimodal inputs (such as multimodal information and time sensitivity), failing to provide a more vivid and convenient interactive experience. - These systems are inadequate in handling multimodal information (such as visual context), leading to responses that may be inaccurate or lack contextual awareness. 2. **Potential of Visual-Context Augmented Dialogue Systems**: - VAD systems can engage in more natural and harmonious interactions with humans by perceiving and understanding multimodal information (including visual context from images or videos and dialogue history). - By combining the consistency and complementarity of visual and textual contexts, VAD systems have the potential to generate engaging and context-aware responses. 3. **Key Challenges of VAD Systems**: - **Efficient Video Processing and Understanding**: VAD needs to extract semantic features of visual scenes from static images or dynamic videos. Videos contain multi-level spatiotemporal structures, which traditional 2D Convolutional Neural Networks (CNNs) cannot effectively handle. Standard 3D CNNs can capture fine-grained spatiotemporal features but impose significant computational and storage burdens, affecting the efficiency of real-time human-machine interaction. - **Cross-Modal Semantic Relationship Reasoning**: Due to the heterogeneity of different modal data, there is a significant semantic gap between visual and language feature spaces. The semantic interaction between dialogue history context and visual information is dynamically changing, especially in dynamic videos with spatiotemporal variations. Understanding and reasoning these semantic associations are crucial for accurately responding to dialogue queries. - **Visual Coreference Phenomenon**: In dialogues, pronouns or abbreviations are often used to refer to previously mentioned language concepts or visual objects. Accurately associating these references with visual targets is another challenge in achieving complex visual and language reasoning. - **Reasonable Evaluation Metrics**: Traditional evaluation metrics for text-based dialogue systems cannot measure whether the dialogue agent truly understands the visual information in images or videos. Developing reasonable evaluation metrics for VAD quality is a problem that needs to be addressed. 4. **Research Trends and Development Directions**: - The paper also discusses some open questions and promising research directions, such as the cognitive mechanisms of human-machine dialogue in cross-modal dialogue contexts and knowledge-enhanced cross-modal semantic interaction, to promote the development of related research. In summary, this paper aims to stimulate research interest in this emerging field and provide guidance for future research through a comprehensive summary of the conceptual model, system architecture, research challenges, and representative works of VAD systems.

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Teaching Machines to Converse

ViDA-MAN: Visual Dialog with Digital Humans

Towards Enhanced Context Awareness with Vision-based Multimodal Interfaces

A New Mmwave-Speech Multimodal Speech System for Voice User Interface

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

Hybrid Graph Reasoning with Dynamic Interaction for Visual Dialog

Building Goal-Oriented Dialogue Systems with Situated Visual Context

Human-Computer Interaction System: A Survey of Talking-Head Generation

User Behavior Fusion in Dialog Management with Multi-Modal History Cues

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Challenges in Building Intelligent Open-domain Dialog Systems

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

CHAT: a Conversational Helper for Automotive Tasks

Enhancing Augmented Reality Dialogue Systems with Multi-Modal Referential Information

PCDialogEval: Persona and Context Aware Emotional Dialogue Evaluation

Engaging Live Video Comments Generation

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Digital twin improved via visual question answering for vision-language interactive mode in human–machine collaboration

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models