Abstract:We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model's proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at <a class="link-external link-https" href="https://nam2speech.github.io/NAM2Speech/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the intelligibility in the task of converting from Non - Audible Murmur (NAM) to speech. Specifically, the existing NAM - to - speech methods have the following limitations: 1. **Dependence on recorded ground - truth speech**: Existing methods usually require explicitly recording high - quality ground - truth speech data, which may be difficult to achieve in practical applications. 2. **Low clarity and quality of synthesized speech**: The synthesized speech generated by current methods performs poorly in terms of clarity and sound quality. 3. **Limited prediction ability**: Most methods can only predict speech representations based on Mel - spectrogram features, limiting their ability to synthesize new sounds. 4. **Small database scale**: Due to the small scale of available NAM datasets, the advantages of modern deep - learning techniques cannot be fully utilized. To solve these problems, this paper proposes a new method that uses self - supervision and Sequence - to - Sequence (Seq2Seq) learning techniques to simulate ground - truth speech. Through this method, researchers can significantly improve the clarity of synthesized speech without relying on actual recordings and demonstrate better performance than the existing state - of - the - art methods. In addition, this study also introduces data augmentation techniques and Dynamic Time Warping (DTW) methods to optimize the alignment between NAM signals and synthesized speech. ### Specific improvement points - **No need to explicitly record ground - truth speech**: High - quality ground - truth speech is simulated through self - supervision and speech - to - speech synthesis techniques. - **Improve the clarity of synthesized speech**: A new data augmentation technique is proposed to generate more NAM speech samples, and the DTW method is introduced to optimize time alignment. - **Cross - modal learning framework**: The Seq2Seq learning algorithm is adopted to perform cross - modal learning in the latent space, enabling the model to effectively clone speech content to new sounds. Through these improvements, this study not only improves the clarity of synthesized speech but also demonstrates the ability to synthesize speech on new sounds, thus providing a new solution for the NAM - to - speech conversion task.

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Can Knowledge of End-to-End Text-to-Speech Models Improve Neural MIDI-to-Audio Synthesis Systems?

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision.

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Improving Sequence-to-sequence Voice Conversion by Adding Text-supervision

Neural Speech Synthesis with Transformer Network.

Iteratively Improving Speech Recognition and Voice Conversion

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech

Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

Self-Supervised Representations for Singing Voice Conversion