Abstract:We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to achieve efficient multilingual speech - to - speech translation (S2ST) while preserving the voice characteristics of the input speaker, namely zero - shot voice cloning. Specifically, the paper proposes a system named DiffuseST, which can directly translate multiple source languages into English and preserve characteristics such as the speaker's timbre, emotion, and intonation during the translation process. ### Main Problems and Solutions 1. **Low Latency and Efficiency**: - Existing S2ST systems usually rely on cascaded models (ASR + MT + TTS), which are slow and cannot fully utilize non - textual information in the audio. - DiffuseST improves audio quality and speaker similarity by introducing a diffusion - based synthesizer, while significantly reducing inference latency, making the processing speed of the entire model more than 5 times faster than real - time. 2. **Zero - Shot Voice Cloning**: - Many existing S2ST systems cannot well preserve the speaker characteristics of the input voice when outputting the voice. - DiffuseST can achieve zero - shot voice cloning with a small amount of parallel data by pre - training the diffusion synthesizer on diverse voice data, that is, it can preserve the voice characteristics of a specific speaker even without having seen that speaker. 3. **Parameter Efficiency and Scalability**: - Although the number of parameters in DiffuseST's diffusion synthesizer is more than twice that of the NAT synthesizer, its inference speed is faster, which makes it possible for future streaming processing and larger - scale applications. - This model is trained only with public data, ensuring the reproducibility and transparency of the research and also reducing the dependence on private data. ### Experimental Results - **Translation Quality**: Although there is a slight decrease in the BLEU score (0.25 points), this difference is not statistically significant (p = 0.35). Considering the significant improvement in audio quality and speaker similarity, this small loss is acceptable. - **Audio Quality**: The diffusion synthesizer has increased by approximately 23% in MOS and PESQ scores respectively, indicating its obvious advantage in audio quality. - **Speaker Similarity**: The cosine similarity of the diffusion synthesizer has increased by 4.6% in all languages, showing its superior performance in voice cloning. ### Conclusion By introducing the diffusion synthesizer, DiffuseST has successfully solved the problems of existing S2ST systems in terms of low latency, voice cloning, and parameter efficiency, paving the way for future streaming processing and higher - quality S2ST systems.

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Divergence-Guided Simultaneous Speech Translation

DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer

DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

SimulTron: On-Device Simultaneous Speech to Speech Translation

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

DiffVoice: Text-to-Speech with Latent Diffusion

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

TransFusion: Transcribing Speech with Multinomial Diffusion

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation