Abstract:Recently, engagement has emerged as a key variable explaining the success of conversation. In the perspective of human-machine interaction, an automatic assessment of engagement becomes crucial to better understand the dynamics of an interaction and to design socially-aware robots. This paper presents a predictive model of the level of engagement in conversations. It shows in particular the interest of using a rich multimodal set of features, outperforming the existing models in this domain. In terms of methodology, study is based on two audio-visual corpora of naturalistic face-to-face interactions. These resources have been enriched with various annotations of verbal and nonverbal behaviors, such as smiles, head nods, and feedbacks. In addition, we manually annotated gestures intensity. Based on a review of previous works in psychology and human-machine interaction, we propose a new definition of the notion of engagement, adequate for the description of this phenomenon both in natural and mediated environments. This definition have been implemented in our annotation scheme. In our work, engagement is studied at the turn level, known to be crucial for the organization of the conversation. Even though there is still a lack of consensus around their precise definition, we have developed a turn detection tool. A multimodal characterization of engagement is performed using a multi-level classification of turns. We claim a set of multimodal cues, involving prosodic, mimo-gestural and morpho-syntactic information, is relevant to characterize the level of engagement of speakers in conversation. Our results significantly outperform the baseline and reach state-of-the-art level (0.76 weighted F-score). The most contributing modalities are identified by testing the performance of a two-layer perceptron when trained on unimodal feature sets and on combinations of two to four modalities. These results support our claim about multimodality: combining features related to the speech fundamental frequency and energy with mimo-gestural features leads to the best performance.

DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Engagement Detection in Online Learning Based on Pre-trained Vision Transformer and Temporal Convolutional Network

A multimodal approach for modeling engagement in conversation

AIMDiT: Modality Augmentation and Interaction via Multimodal Dimension Transformation for Emotion Recognition in Conversations

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

Multi-Modal Adaptive Fusion Transformer Network for the Estimation of Depression Level

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

DialogueTRM: Exploring the Intra- and Inter-Modal Emotional Behaviors in the Conversation

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Multimodal Activation: Awakening Dialog Robots Without Wake Words

DialogueTRM: Exploring Multi-Modal Emotional Dynamics in a Conversation

MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation

AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis

Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers