Abstract:Silent speech interfaces (SSIs) convert non-audio bio-signals, such as articulatory movement, to speech. This technology has the potential to recover the speech ability of individuals who have lost their voice but can still articulate (e.g., laryngectomees). Articulation-to-speech (ATS) synthesis is an algorithm design of SSI that has the advantages of easy-implementation and low-latency, and therefore is becoming more popular. Current ATS studies focus on speaker-dependent (SD) models to avoid large variations of articulatory patterns and acoustic features across speakers. However, these designs are limited by the small data size from individual speakers. Speaker adaptation designs that include multiple speakers' data have the potential to address the issue of limited data size from single speakers; however, few prior studies have investigated their performance in ATS. In this paper, we investigated speaker adaptation on both the input articulation and the output acoustic signals (with or without direct inclusion of data from test speakers) using the publicly available electromagnetic articulatory (EMA) dataset. We used Procrustes matching and voice conversion for articulation and voice adaptation, respectively. The performance of the ATS models was measured objectively by the mel-cepstral distortions (MCDs). The synthetic speech samples were generated and are provided in the supplementary material. The results demonstrated the improvement brought by both Procrustes matching and voice conversion on speaker-independent ATS. With the direct inclusion of target speaker data in the training process, the speaker-adaptive ATS achieved a comparable performance to speaker-dependent ATS. To our knowledge, this is the first study that has demonstrated that speaker-adaptive ATS can achieve a non-statistically different performance to speaker-dependent ATS.

The Secret Source : Incorporating Source Features to Improve Acoustic-to-Articulatory Speech Inversion

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy

Two-Stream Joint-Training for Speaker Independent Acoustic-to-Articulatory Inversion

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

Accent Conversion with Articulatory Representations

Speaker-independent speech inversion for recovery of velopharyngeal port constriction degreea)

Cepstral Smoothing of Spectral Masks for Acoustic Vector-Sensor Based Convolutive Speech Separation

Research on the Distal Supervised Learning Model of Speech Inversion.

Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?

Speaker-independent Speech Inversion for Estimation of Nasalance

Spectral-change Enhancement with Prior SNR for the Hearing Impaired

Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization

Comprehensive Source-Target Speaker Voice Conversion Analysis

Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis

Estimate Articulatory Mri Series From Acoustic Signal Using Deep Architecture

SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion

Three Dimensional Acoustic Contrast Source Inversion Method

Improving Separation of Harmonic Sources with Iterative Estimation of Spatial Cues

Combined Articulatory and Auditory Processing for Improved Speech Recognition