Abstract:In this work, we develop SimulSpeech, an endto-end simultaneous speech to text translation system which translates speech in source language to text in target language concurrently.SimulSpeech consists of a speech encoder, a speech segmenter and a text decoder, where 1) the segmenter builds upon the encoder and leverages a connectionist temporal classification (CTC) loss to split the input streaming speech in real time, 2) the encoder-decoder attention adopts a wait-k strategy for simultaneous translation.SimulSpeech is more challenging than previous cascaded systems (with simultaneous automatic speech recognition (ASR) and simultaneous neural machine translation (NMT)).We introduce two novel knowledge distillation methods to ensure the performance: 1) Attention-level knowledge distillation transfers the knowledge from the multiplication of the attention matrices of simultaneous NMT and ASR models to help the training of the attention mechanism in SimulSpeech; 2) Data-level knowledge distillation transfers the knowledge from the full-sentence NMT model and also reduces the complexity of data distribution to help on the optimization of Simul-Speech.Experiments on MuST-C English-Spanish and English-German spoken language translation datasets show that SimulSpeech achieves reasonable BLEU scores and lower delay compared to full-sentence end-to-end speech to text translation (without simultaneous translation), and better performance than the two-stage cascaded simultaneous translation model in terms of BLEU scores and translation delay.

Improving Speech Transcription For Mandarin-English Translation

From Speech to Text in Chinese: A Computer-Aided Transcription System for the Legal Domain.

The BBN Mandarin Broadcast News Transcription System

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis

Rethinking and Improving Multi-task Learning for End-to-end Speech Translation

Advancing Speech Translation: A Corpus of Mandarin-English Conversational Telephone Speech

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Statistically-based Model for Computer-Aided Transcription Application

Court Stenography-To-Text ("STT") in Hong Kong: A Jurilinguistic Engineering Effort

Jurilinguistic engineering in Cantonese Chinese: an N-gram-based speech to text transcription system

Automatic Spelling Correction with Transformer for CTC-based End-to-End Speech Recognition

An Unified and Automatic Approach of Mandarin HTS System.

Investigation of Transformer Based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

Back Translation for Speech-to-text Translation Without Transcripts

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought

Advances in Syntax-Based Malay-English Speech Translation

SimulSpeech: End-to-End Simultaneous Speech to Text Translation.