Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

Niko Moritz,Ruiming Xie,Yashesh Gaur,Ke Li,Simone Merello,Zeeshan Ahmed,Frank Seide,Christian Fuegen
2024-12-20
Abstract:We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.
Audio and Speech Processing,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced when performing automatic speech recognition (ASR) and speech translation (ST) simultaneously. Specifically, traditional cascaded systems usually perform ASR and machine translation (MT) in steps, which will lead to the following problems: 1. **Increased latency**: In streaming applications, the complex beam - search algorithm increases latency. 2. **Error propagation**: Since ASR and MT are processed separately, errors in ASR may propagate to the translation stage, affecting the final translation quality. 3. **Low efficiency**: It is difficult for traditional methods to make full use of large - scale text datasets to train MT models. To solve these problems, the authors propose the Joint Speech Recognition and Translation (JSTAR) model, which has the following features: - **Fast - slow cascaded encoder architecture**: By using fast and slow encoders to balance low latency and high accuracy. The fast encoder is used for ASR tasks, and the slow encoder is used for ST tasks to ensure a broader context understanding. - **Multi - objective training strategy**: Optimize ASR and ST goals while training simultaneously, enabling JSTAR to produce high - quality ASR and ST results in a streaming environment. - **Multi - channel directional ASR solution**: It is especially suitable for bilingual conversation scenarios on smart glasses and can distinguish voices from different directions, corresponding to the wearer and the conversation partner respectively. In addition, the authors also study different pre - training strategies, including initially training the RNN - T - based streaming MT model and using it to initialize the parameters of JSTAR to further improve performance. Through these improvements, the JSTAR model outperforms the powerful cascaded ST model in both BLEU score and latency.