Abstract:It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user's multi-modal behaviors with behaviors' historical information in dialog management (DM) to improve the avatar's sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech comprehension, we divide the user's multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar's response in later dialog turns. Usually, the conflict behavior reflects user's ambiguous intention (for example: user says "no" while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: "Ask", "Answer", "Chat" and "Forget". With the detection of complementation and independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users' facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation.

User Attention-guided Multimodal Dialog Systems

Knowledge-aware Multimodal Dialogue Systems.

Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements

A non-hierarchical attention network with modality dropout for textual response generation in multimodal dialogue systems

Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational Recommendation

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems

HAIN: Multi-label Classification with Hierarchical Attention-based Interaction Network for Multi-turn Dialogue Texts

Neural Multimodal Belief Tracker with Adaptive Attention for Dialogue Systems

Multi-modal multi-hop interaction network for dialogue response generation

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

S3: A Simple Strong Sample-effective Multimodal Dialog System

Dual Semantic Knowledge Composed Multimodal Dialog Systems

Multi-View Attention Network for Visual Dialog

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

User Behavior Fusion in Dialog Management with Multi-Modal History Cues

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations

Entropy-Enhanced Multimodal Attention Model for Scene-Aware Dialogue Generation

DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Structure-Aware Multimodal Sequential Learning for Visual Dialog