Abstract:Silent speech interfaces (SSIs) convert non-audio bio-signals, such as articulatory movement, to speech. This technology has the potential to recover the speech ability of individuals who have lost their voice but can still articulate (e.g., laryngectomees). Articulation-to-speech (ATS) synthesis is an algorithm design of SSI that has the advantages of easy-implementation and low-latency, and therefore is becoming more popular. Current ATS studies focus on speaker-dependent (SD) models to avoid large variations of articulatory patterns and acoustic features across speakers. However, these designs are limited by the small data size from individual speakers. Speaker adaptation designs that include multiple speakers' data have the potential to address the issue of limited data size from single speakers; however, few prior studies have investigated their performance in ATS. In this paper, we investigated speaker adaptation on both the input articulation and the output acoustic signals (with or without direct inclusion of data from test speakers) using the publicly available electromagnetic articulatory (EMA) dataset. We used Procrustes matching and voice conversion for articulation and voice adaptation, respectively. The performance of the ATS models was measured objectively by the mel-cepstral distortions (MCDs). The synthetic speech samples were generated and are provided in the supplementary material. The results demonstrated the improvement brought by both Procrustes matching and voice conversion on speaker-independent ATS. With the direct inclusion of target speaker data in the training process, the speaker-adaptive ATS achieved a comparable performance to speaker-dependent ATS. To our knowledge, this is the first study that has demonstrated that speaker-adaptive ATS can achieve a non-statistically different performance to speaker-dependent ATS.

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

AdaptiveFormer: A Few-shot Speaker Adaptive Speech Synthesis Model Based on FastSpeech2

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Prosodic Parameter Manipulation in TTS generated speech for Controlled Speech Generation

Adaptive Filter Based Prosody Modification Approach

Linear Networks Based Speaker Adaptation for Speech Synthesis

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

Speech Recognition Using Speaker Adaptation by System Parameter Transformation.

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

Duration optimization of speaker adaptation in Mandarin TTS

An Improved Cross-Language Model Adaptation Method for Speech Synthesis

Prosody and voice factorization for few-shot speaker adaptation in the challenge m2voc 2021

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

AS-Speech: Adaptive Style for Speech Synthesis