Abstract:Simultaneous speech to speech translation aims to interpret concurrently with the speech in source language, which is of great importance to the real-time understanding of spoken lectures or conversations. Previous methods usually divide this problem into three stages: simultaneous automatic speech recognition (ASR), simultaneous neural machine translation (NMT), and simultaneous text to speech (TTS), which is not end-to-end and suffers from translation delay and error propagation. In this work, we propose SimulS2S, an end-to-end simultaneous speech to speech translation system that directly translates from source-language speech into target-language speech concurrently, which jointly optimizes speech recognition, text translation and speech synthesis in one sequence to sequence model. SimulS2S consists of a speech encoder and a speech decoder both with a speech segmenter and a wait- strategy for simultaneous translation. Since simultaneous speech to speech translation is challenging, we propose several key techniques to help the training of SimulS2S: 1) a curriculum learning mechanism to train the model gradually from full-sentence translation to simultaneous translation; 2) two auxiliary tasks: ASR and S2T (speech to text translation) that share the same encoder with SimulS2S model to help the training of the encoder; 3) knowledge distillation to transfer the knowledge from the cascaded NMT and TTS models to the SimulS2S model. Experiments on Fisher Spanish-English conversation translation datasets demonstrate that SimulS2S 1) achieves low translation delay and reasonable translation quality compared with full …

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

SimulSpeech: End-to-End Simultaneous Speech to Text Translation.

Divergence-Guided Simultaneous Speech Translation

Recent Advances in End-to-End Simultaneous Speech Translation

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

SimulSLT: End-to-End Simultaneous Sign Language Translation

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

SimulTron: On-Device Simultaneous Speech to Speech Translation

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation Data

SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

SimulEval: An Evaluation Toolkit for Simultaneous Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Learning Adaptive Segmentation Policy for End-to-End Simultaneous Translation

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation