Abstract:Simultaneous speech to speech translation aims to interpret concurrently with the speech in source language, which is of great importance to the real-time understanding of spoken lectures or conversations. Previous methods usually divide this problem into three stages: simultaneous automatic speech recognition (ASR), simultaneous neural machine translation (NMT), and simultaneous text to speech (TTS), which is not end-to-end and suffers from translation delay and error propagation. In this work, we propose SimulS2S, an end-to-end simultaneous speech to speech translation system that directly translates from source-language speech into target-language speech concurrently, which jointly optimizes speech recognition, text translation and speech synthesis in one sequence to sequence model. SimulS2S consists of a speech encoder and a speech decoder both with a speech segmenter and a wait- strategy for simultaneous translation. Since simultaneous speech to speech translation is challenging, we propose several key techniques to help the training of SimulS2S: 1) a curriculum learning mechanism to train the model gradually from full-sentence translation to simultaneous translation; 2) two auxiliary tasks: ASR and S2T (speech to text translation) that share the same encoder with SimulS2S model to help the training of the encoder; 3) knowledge distillation to transfer the knowledge from the cascaded NMT and TTS models to the SimulS2S model. Experiments on Fisher Spanish-English conversation translation datasets demonstrate that SimulS2S 1) achieves low translation delay and reasonable translation quality compared with full …

Curriculum Pre-training for End-to-End Speech Translation

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

Bridging the Gap Between Pre-Training and Fine-Tuning for End-to-End Speech Translation

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

Curriculum pre-training for stylized neural machine translation

Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data.

End-to-End Tibetan-Chinese Speech Translation Based on Multi-task and Multi-level Pre-training

Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders

Pre-training for Speech Translation: CTC Meets Optimal Transport

Unified Speech-Text Pre-training for Speech Translation and Recognition

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Efficient Speech Translation with Pre-trained Models

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

End-to-End Speech Translation with Knowledge Distillation

The Interpreter Understands Your Meaning: End-to-end Spoken Language Understanding Aided by Speech Translation

Denoising Pre-training for Machine Translation Quality Estimation with Curriculum Learning.

Self-Training for End-to-End Speech Translation

Joint Training and Decoding for Multilingual End-to-End Simultaneous Speech Translation

Language Model Pre-training with Linguistically Motivated Curriculum Learning

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

Structured-based Curriculum Learning for End-to-end English-Japanese Speech Translation