Abstract:This study investigated the effect of synthetic voice of conversational agent trained with spontaneous speech on human interactants. Specifically, we hypothesized that humans will exhibit more social responses when interacting with conversational agent that has a synthetic voice built on spontaneous speech. Typically, speech synthesizers are built on a speech corpus where voice professionals read a set of written sentences. The synthesized speech is clear as if a newscaster were reading a news or a voice actor were playing an anime character. However, this is quite different from spontaneous speech we speak in everyday conversation. Recent advances in speech synthesis enabled us to build a speech synthesizer on a spontaneous speech corpus, and to obtain a near conversational synthesized speech with reasonable quality. By making use of these technology, we examined whether humans produce more social responses to a spontaneously speaking conversational agent. We conducted a large-scale conversation experiment with a conversational agent whose utterances were synthesized with the model trained either with spontaneous speech or read speech. The result showed that the subjects who interacted with the agent whose utterances were synthesized from spontaneous speech tended to show shorter response time and a larger number of backchannels. The result of a questionnaire showed that subjects who interacted with the agent whose utterances were synthesized from spontaneous speech tended to rate their conversation with the agent as closer to a human conversation. These results suggest that speech synthesis built on spontaneous speech is essential to realize a conversational agent as a social actor.

STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center

Building a Dialogue Corpus Annotated with Expressed and Experienced Emotions

JSUT and JVS: Free Japanese voice corpora for accelerating speech synthesis research

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Building speech corpus with diverse voice characteristics for its prompt-based representation

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

Towards human-like spoken dialogue generation between AI agents from written dialogue

EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

Text-driven Visual Prosody Generation for Embodied Conversational Agents

How does a spontaneously speaking conversational agent affect user behavior?

EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis

Prevalence and future prediction of type 2 diabetes mellitus in the Kingdom of Saudi Arabia: A systematic review of published studies.

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

Toward a Dialogue System Using a Large Language Model to Recognize User Emotions with a Camera

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Data-Driven Dialogue Systems for Social Agents