Abstract:Recently, engagement has emerged as a key variable explaining the success of conversation. In the perspective of human-machine interaction, an automatic assessment of engagement becomes crucial to better understand the dynamics of an interaction and to design socially-aware robots. This paper presents a predictive model of the level of engagement in conversations. It shows in particular the interest of using a rich multimodal set of features, outperforming the existing models in this domain. In terms of methodology, study is based on two audio-visual corpora of naturalistic face-to-face interactions. These resources have been enriched with various annotations of verbal and nonverbal behaviors, such as smiles, head nods, and feedbacks. In addition, we manually annotated gestures intensity. Based on a review of previous works in psychology and human-machine interaction, we propose a new definition of the notion of engagement, adequate for the description of this phenomenon both in natural and mediated environments. This definition have been implemented in our annotation scheme. In our work, engagement is studied at the turn level, known to be crucial for the organization of the conversation. Even though there is still a lack of consensus around their precise definition, we have developed a turn detection tool. A multimodal characterization of engagement is performed using a multi-level classification of turns. We claim a set of multimodal cues, involving prosodic, mimo-gestural and morpho-syntactic information, is relevant to characterize the level of engagement of speakers in conversation. Our results significantly outperform the baseline and reach state-of-the-art level (0.76 weighted F-score). The most contributing modalities are identified by testing the performance of a two-layer perceptron when trained on unimodal feature sets and on combinations of two to four modalities. These results support our claim about multimodality: combining features related to the speech fundamental frequency and energy with mimo-gestural features leads to the best performance.

Multimodal Activation: Awakening Dialog Robots Without Wake Words

Multimodal Human-robot Interaction on Service Robot

A Multimodal Emotional Communication Based Humans-Robots Interaction System

Multimodal fusion-powered English speaking robot

A Multimodal Approach of Generating 3D Human-Like Talking Agent.

Research on Multimodal Human-Robot Interaction Based on Speech and Gesture.

Multimodal interaction enhanced representation learning for video emotion recognition

A multimodal human-robot sign language interaction framework applied in social robots

Continuous Multi-Modal Human Interest Detection for a Domestic Companion Humanoid Robot.

A multimodal approach for modeling engagement in conversation

Multi-modal Human-machine Conversation System for Real Physical World

Incorporating Multimodal Sentiments into Conversational Bots for Service Requirement Elicitation.

User Attention-guided Multimodal Dialog Systems

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation

Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents

Advantages of Multimodal versus Verbal-Only Robot-to-Human Communication with an Anthropomorphic Robotic Mock Driver

On-device audio-visual multi-person wake word spotting

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Knowledge-enhanced Interactive Network with Mixed Contrastive Learning for Emotion Recognition in Conversation