Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Neil Shah,Shirish Karande,Vineet Gandhi
2024-07-26
Abstract:We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model's proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at <a class="link-external link-https" href="https://nam2speech.github.io/NAM2Speech/" rel="external noopener nofollow">this https URL</a>
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the intelligibility in the task of converting from Non - Audible Murmur (NAM) to speech. Specifically, the existing NAM - to - speech methods have the following limitations: 1. **Dependence on recorded ground - truth speech**: Existing methods usually require explicitly recording high - quality ground - truth speech data, which may be difficult to achieve in practical applications. 2. **Low clarity and quality of synthesized speech**: The synthesized speech generated by current methods performs poorly in terms of clarity and sound quality. 3. **Limited prediction ability**: Most methods can only predict speech representations based on Mel - spectrogram features, limiting their ability to synthesize new sounds. 4. **Small database scale**: Due to the small scale of available NAM datasets, the advantages of modern deep - learning techniques cannot be fully utilized. To solve these problems, this paper proposes a new method that uses self - supervision and Sequence - to - Sequence (Seq2Seq) learning techniques to simulate ground - truth speech. Through this method, researchers can significantly improve the clarity of synthesized speech without relying on actual recordings and demonstrate better performance than the existing state - of - the - art methods. In addition, this study also introduces data augmentation techniques and Dynamic Time Warping (DTW) methods to optimize the alignment between NAM signals and synthesized speech. ### Specific improvement points - **No need to explicitly record ground - truth speech**: High - quality ground - truth speech is simulated through self - supervision and speech - to - speech synthesis techniques. - **Improve the clarity of synthesized speech**: A new data augmentation technique is proposed to generate more NAM speech samples, and the DTW method is introduced to optimize time alignment. - **Cross - modal learning framework**: The Seq2Seq learning algorithm is adopted to perform cross - modal learning in the latent space, enabling the model to effectively clone speech content to new sounds. Through these improvements, this study not only improves the clarity of synthesized speech but also demonstrates the ability to synthesize speech on new sounds, thus providing a new solution for the NAM - to - speech conversion task.