Abstract:Silent speech interfaces (SSIs) have emerged as innovative non-acoustic communication methods, and our previous study demonstrated the significant potential of three-axis accelerometer-based SSIs to identify silently spoken words with high classification accuracy. The developed accelerometer-based SSI with only four accelerometers and a small training dataset outperformed a conventional surface electromyography (sEMG)-based SSI. In this study, motivated by the promising initial results, we investigated the feasibility of synthesizing spoken speech from three-axis accelerometer signals. This exploration aimed to assess the potential of accelerometer-based SSIs for practical silent communication applications. Nineteen healthy individuals participated in our experiments. Five accelerometers were attached to the face to acquire speech-related facial movements while the participants read 270 Korean sentences aloud. For the speech synthesis, we used a convolution-augmented Transformer (Conformer)-based deep neural network model to convert the accelerometer signals into a Mel spectrogram, from which an audio waveform was synthesized using HiFi-GAN. To evaluate the quality of the generated Mel spectrograms, ten-fold cross-validation was performed, and the Mel cepstral distortion (MCD) was chosen as the evaluation metric. As a result, an average MCD of 5.03 ± 0.65 was achieved using four optimized accelerometers based on our previous study. Furthermore, the quality of generated Mel spectrograms was significantly enhanced by adding one more accelerometer attached under the chin, achieving an average MCD of 4.86 ± 0.65 (p < 0.001, Wilcoxon signed-rank test). Although an objective comparison is difficult, these results surpass those obtained using conventional SSIs based on sEMG, electromagnetic articulography, and electropalatography with the fewest sensors and a similar or smaller number of sentences to train the model. Our proposed approach will contribute to the widespread adoption of accelerometer-based SSIs, leveraging the advantages of accelerometers like low power consumption, invulnerability to physiological artifacts, and high portability.

SVoice: Enabling Voice Communication in Silence Via Acoustic Sensing on Commodity Devices.

SVoice

UltraSR: Silent Speech Reconstruction Via Acoustic Sensing

Hybrid Silent Speech Interface Through Fusion of Electroencephalography and Electromyography

Toward Pitch-Insensitive Speaker Verification Via Soundfield

Decoding Silent Speech Commands from Articulatory Movements Through Soft Magnetic Skin and Machine Learning

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation

SuperVoice: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

All-weather, natural silent speech recognition via machine-learning-assisted tattoo-like electronics

USpeech: Ultrasound-Enhanced Speech with Minimal Human Effort via Cross-Modal Synthesis

Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training

Silent Speech Eyewear Interface: Silent Speech Recognition Method Using Eyewear and an Ear-Mounted Microphone with Infrared Distance Sensors

Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language

Speech synthesis from three-axis accelerometer signals using conformer-based deep neural network

Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

EarSSR: Silent Speech Recognition via Earphones