Abstract:Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning. WESPER consists of a speech-to-unit (STU) encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech (UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities. (project page: <a class="link-external link-http" href="http://lab.rekimoto.org/projects/wesper" rel="external noopener nofollow">this http URL</a> )

Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Synthetic Data for Neural Machine Translation of Spoken-Dialects

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Speech Synthesis as Augmentation for Low-Resource ASR

A multilingual training strategy for low resource Text to Speech

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

KIT's Multilingual Speech Translation System for IWSLT 2023

ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

Pushing the Limits of Zero-shot End-to-End Speech Translation

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis

Exploring Speech Enhancement for Low-resource Speech Synthesis

Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation

Cross-Lingual Transfer Learning for Speech Translation

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions