DiariST: Streaming Speech Translation with Speaker Diarization

Mu Yang,Naoyuki Kanda,Xiaofei Wang,Junkun Chen,Peidong Wang,Jian Xue,Jinyu Li,Takuya Yoshioka
2024-01-23
Abstract:End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
This paper attempts to solve the problems of simultaneous speech translation (ST) and speaker diarization (SD) in dialogue recordings. Specifically, the paper focuses on several under - explored challenges, such as performing speaker diarization without accurate word - level timestamps and handling overlapping speech in a streaming manner. These problems are difficult to be effectively solved in traditional cascaded systems, because these systems usually need to wait for the generation of automatic speech recognition (ASR) results before performing machine translation (MT), resulting in large latencies and difficulties in handling multi - speaker scenarios. To solve these problems, the paper proposes DiariST, which is the first system capable of performing streaming speech translation and speaker diarization simultaneously. DiariST is built on neural transducers and integrates token - based serialized output training (t - SOT) and t - vector techniques, which were originally developed for multi - speaker speech recognition. In addition, due to the lack of evaluation benchmarks, the researchers also developed a new evaluation dataset, DiariST - AliMeeting, which was constructed by translating Chinese reference transcripts in the AliMeeting corpus into English. At the same time, they proposed two new evaluation metrics: Speaker - Agnostic BLEU (SAgBLEU) and Speaker - Aware BLEU (SAtBLEU), which are used to evaluate translation quality while considering the accuracy of speaker diarization. The experimental results show that the DiariST system can effectively handle overlapping speech while maintaining low latencies and outperforms offline baseline systems under multiple conditions.