Abstract:Predicting when it is an artificial agent’s turn to speak/act during human-agent interaction (HAI) poses a significant challenge due to the necessity of real-time processing, context sensitivity, capturing complex human behavior, effectively integrating multiple modalities, and addressing class imbalance. In this paper, we present a novel deep learning network-based approach for predicting turn-taking events in HAI that leverages information from multiple modalities, including text, audio, vision, and context data. Our study demonstrates that incorporating additional modalities, including in-game context data, enables a more comprehensive understanding of interaction dynamics leading to enhanced prediction accuracy for the artificial agent. The efficiency of the model also permits potential real-time applications. We evaluated our proposed model on an imbalanced dataset of both successful and failed turn-taking attempts during an HAI cooperative gameplay scenario, comprising over 125,000 instances, and employed a focal loss function to address class imbalance. Our model outperformed baseline models, such as Early Fusion LSTM (EF-LSTM), Late Fusion LSTM (LF-LSTM), and the state-of-the-art Multimodal Transformer (Mult). Additionally, we conducted an ablation study to investigate the contributions of individual modality components within our model, revealing the significant role of speech content cues. In conclusion, our proposed approach demonstrates considerable potential in predicting turn-taking events within HAI, providing a foundation for future research with physical robots during human-robot interaction (HRI).

Considering Temporal Connection Between Turns for Conversational Speech Synthesis

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

Target conversation extraction: Source separation using turn-taking dynamics

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Real-Time Multimodal Turn-taking Prediction to Enhance Cooperative Dialogue during Human-Agent Interaction

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic Speech Synthesis

IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Interactive Conversational Head Generation

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models

Duration optimization of speaker adaptation in Mandarin TTS

Improving Deep Neural Network Based Speech Synthesis Through Contextual Feature Parametrization and Multi-Task Learning