Abstract:We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate superior performances of JSTAR compared to a strong cascaded ST model in both BLEU scores and latency.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced when performing automatic speech recognition (ASR) and speech translation (ST) simultaneously. Specifically, traditional cascaded systems usually perform ASR and machine translation (MT) in steps, which will lead to the following problems: 1. **Increased latency**: In streaming applications, the complex beam - search algorithm increases latency. 2. **Error propagation**: Since ASR and MT are processed separately, errors in ASR may propagate to the translation stage, affecting the final translation quality. 3. **Low efficiency**: It is difficult for traditional methods to make full use of large - scale text datasets to train MT models. To solve these problems, the authors propose the Joint Speech Recognition and Translation (JSTAR) model, which has the following features: - **Fast - slow cascaded encoder architecture**: By using fast and slow encoders to balance low latency and high accuracy. The fast encoder is used for ASR tasks, and the slow encoder is used for ST tasks to ensure a broader context understanding. - **Multi - objective training strategy**: Optimize ASR and ST goals while training simultaneously, enabling JSTAR to produce high - quality ASR and ST results in a streaming environment. - **Multi - channel directional ASR solution**: It is especially suitable for bilingual conversation scenarios on smart glasses and can distinguish voices from different directions, corresponding to the wearer and the conversation partner respectively. In addition, the authors also study different pre - training strategies, including initially training the RNN - T - based streaming MT model and using it to initialize the parameters of JSTAR to further improve performance. Through these improvements, the JSTAR model outperforms the powerful cascaded ST model in both BLEU score and latency.

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

SimulS2S: End-to-End Simultaneous Speech to Speech Translation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

Bridging the Modality Gap for Speech-to-Text Translation

LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

SimulTron: On-Device Simultaneous Speech to Speech Translation

Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation Data

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Tight Integrated End-to-End Training for Cascaded Speech Translation

Aligning Pre-trained Models for Spoken Language Translation

Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

Joint Speech-Text Embeddings for Multitask Speech Processing

Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Towards End-to-end Speech-to-text Translation with Two-pass Decoding