Abstract:Silent Speech Interface (SSI) has been proposed as a means of reconstructing audible speech from silent articulatory gestures for covert voice communication in public and voice assistance for the aphasic. Prior arts of SSI, either relying on wearable devices or cameras, may lead to extended contact requirements or privacy leakage risks. The recent advances in acoustic sensing have brought new opportunities for sensing gestures, but their original intention is to infer speech content for classification instead of audible speech reconstruction, resulting in the loss of some important speech information (e.g., speech rate, intonation, and emotion). In this paper, we propose, the first system that supports accurate audible speech reconstruction by analyzing the disturbance of tiny articulatory gestures on the reflected ultrasound signal. The design of introduces a new model that provides the unique mapping relationship between ultrasound and speech signals, so that the audible speech can be successfully reconstructed from the silent speech. However, establishing the mapping relationship depends on plenty of training data. Instead of the time-consuming collection of massive amounts of data for training, we construct an inverse task that constitutes a dual form with the original task to generate virtual gestures from widely available audio (e.g., phone calls) for facilitating model training. Furthermore, we introduce a fine-tuning mechanism using unlabeled data for user adaptation. We implement using a portable smartphone and evaluate it in various environments. The evaluation results show that can reconstruct speech with a (Character Error Rate) CER as low as 7.62%, and decrease the CER from 82.77% to 9.42% on new users with only 1 hour of ultrasound signals provided, which outperforms state-of-the-art acoustic-based approaches while preserving rich speech information.

Mapping between ultrasound and vowel speech using DNN framework

Silent Speech Decoding Using Spectrogram Features Based on Neuromuscular Activities

Acoustic to Articulatory Mapping with Deep Neural Network

DNN-based Acoustic-to-Articulatory Inversion using Ultrasound Tongue Imaging

Denoising convolutional autoencoder based B-mode ultrasound tongue image feature extraction

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

Tongue contour extraction from ultrasound images based on deep neural network

Automatic Assessment of Dysarthria Using Audio-visual Vowel Graph Attention Network

Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Estimate Articulatory Mri Series From Acoustic Signal Using Deep Architecture

Deep Speech Synthesis from MRI-Based Articulatory Representations

Synthesized Stereo Mapping Via Deep Neural Networks for Noisy Speech Recognition

Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

A Universal VAD Based on Jointly Trained Deep Neural Networks.

SVoice

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

A Deep Recurrent Approach for Acoustic-to-articulatory Inversion.

Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract