Abstract:Direct speech-to-speech translation (S2ST) with discrete units leverages recent progress in speech representation learning. Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism. In this work, we propose TranSpeech, a speech-to-speech translation model with bilateral perturbation. To alleviate the acoustic multimodal problem, we propose bilateral perturbation (BiP), which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations. With reduced multimodality, we step forward and become the first to establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices and produces high-accuracy results in just a few cycles. Experimental results on three language pairs demonstrate that BiP yields an improvement of 2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our parallel decoding shows a significant reduction of inference latency, enabling speedup up to 21.4x than autoregressive technique. Audio samples are available at \url{https://TranSpeech.github.io/}

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

Divergence-Guided Simultaneous Speech Translation

ESPnet2-TTS: Extending the Edge of TTS Research

Deep Learning Based TTS-STT Model with Transliteration for Indic Languages

NeurST: Neural Speech Translation Toolkit

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Learning When to Speak: Latency and Quality Trade-offs for Simultaneous Speech-to-Speech Translation with Offline Models

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Towards Real-World Streaming Speech Translation for Code-Switched Speech

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

PolyVoice: Language Models for Speech to Speech Translation

Multilingual Speech-to-Speech Translation into Multiple Target Languages

End-to-End Speech Translation for Code Switched Speech

Towards End-to-end Speech-to-text Translation with Two-pass Decoding

Textless Speech-to-Speech Translation With Limited Parallel Data

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation