Abstract:Silent Speech Interface (SSI) have been developed to convert silent articulatory gestures into speech, facilitating silent speech in public spaces and aiding individuals with aphasia. Prior arts of SSI, either relying on wearable devices or cameras, may lead to extended contact requirements or privacy leakage risks. Recent advancements in acoustic sensing offer new opportunitis for gesture sensing. However, they typically focus on content classification rather than on reconstructing audible speech, leading to the loss of crucial speech characteristics such as speech rate, intonation, and emotion. In this paper, we propose UltraSR, a novel sensing system that supports accurate audible speech reconstruction by analyzing the disturbance of tiny articulatory gestures on the reflected ultrasound signal. The design of UltraSR introduces a multi-scale feature extraction scheme for aggregating information from multiple views, and a new model that provides the unique mapping relationship between ultrasound and speech signals, so that the audible speech can be successfully reconstructed from the silent speech. However, establishing the mapping relationship depends on plenty of training data. Instead of the time-consuming collection of massive amounts of data for training, we construct an inverse task that constitutes a dual form with the original task to generate virtual gestures from widely available audio (e.g., phone calls) for facilitating model training. Furthermore, we introduce a fine-tuning mechanism using unlabeled data for user adaptation. We implement UltraSR using a portable smartphone and evaluate it in various environments. The evaluation results show that UltraSR can reconstruct speech with a (Character Error Rate) CER as low as 5.22%, and decrease the CER from 80.13% to 6.31% on new users with only 1 hour of ultrasound signals provided, which outperforms state-of-the-art acoustic-based approaches while preserving rich speech information.

Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

UltraSR: Silent Speech Reconstruction Via Acoustic Sensing

SVoice

SVoice: Enabling Voice Communication in Silence Via Acoustic Sensing on Commodity Devices.

TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss

Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models

An Audio-textual Diffusion Model For Converting Speech Signals Into Ultrasound Tongue Imaging Data

Decoding Silent Speech Commands from Articulatory Movements Through Soft Magnetic Skin and Machine Learning

Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Silent versus modal multi-speaker speech recognition from ultrasound and video

Articulatory-to-acoustic Conversion Using BLSTM-RNNs with Augmented Input Representation.

LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading