Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Nameer Hirschkind,Xiao Yu,Mahesh Kumar Nandwana,Joseph Liu,Eloi DuBois,Dao Le,Nicolas Thiebaut,Colin Sinclair,Kyle Spence,Charles Shang,Zoe Abrams,Morgan McGuire
2024-06-15
Abstract:We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time.
Machine Learning,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to achieve efficient multilingual speech - to - speech translation (S2ST) while preserving the voice characteristics of the input speaker, namely zero - shot voice cloning. Specifically, the paper proposes a system named DiffuseST, which can directly translate multiple source languages into English and preserve characteristics such as the speaker's timbre, emotion, and intonation during the translation process. ### Main Problems and Solutions 1. **Low Latency and Efficiency**: - Existing S2ST systems usually rely on cascaded models (ASR + MT + TTS), which are slow and cannot fully utilize non - textual information in the audio. - DiffuseST improves audio quality and speaker similarity by introducing a diffusion - based synthesizer, while significantly reducing inference latency, making the processing speed of the entire model more than 5 times faster than real - time. 2. **Zero - Shot Voice Cloning**: - Many existing S2ST systems cannot well preserve the speaker characteristics of the input voice when outputting the voice. - DiffuseST can achieve zero - shot voice cloning with a small amount of parallel data by pre - training the diffusion synthesizer on diverse voice data, that is, it can preserve the voice characteristics of a specific speaker even without having seen that speaker. 3. **Parameter Efficiency and Scalability**: - Although the number of parameters in DiffuseST's diffusion synthesizer is more than twice that of the NAT synthesizer, its inference speed is faster, which makes it possible for future streaming processing and larger - scale applications. - This model is trained only with public data, ensuring the reproducibility and transparency of the research and also reducing the dependence on private data. ### Experimental Results - **Translation Quality**: Although there is a slight decrease in the BLEU score (0.25 points), this difference is not statistically significant (p = 0.35). Considering the significant improvement in audio quality and speaker similarity, this small loss is acceptable. - **Audio Quality**: The diffusion synthesizer has increased by approximately 23% in MOS and PESQ scores respectively, indicating its obvious advantage in audio quality. - **Speaker Similarity**: The cosine similarity of the diffusion synthesizer has increased by 4.6% in all languages, showing its superior performance in voice cloning. ### Conclusion By introducing the diffusion synthesizer, DiffuseST has successfully solved the problems of existing S2ST systems in terms of low latency, voice cloning, and parameter efficiency, paving the way for future streaming processing and higher - quality S2ST systems.