Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

Keita Suzuki,Satoshi Suzuki,Naoki Makishima,Ryo Masumura,Atsushi Ando
DOI: https://doi.org/10.21437/interspeech.2023-564
2023-08-20
Abstract:This paper proposes autoregressive modeling of the joint multi-talker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a simple and promising approach. However, it does not predict utterance timestamp information despite its being important in practice. To address this problem, our key idea is to extend autoregressive-modeling-based multi-talker ASR to predict quantized timestamp tokens representing the start and end time of an utterance. Our method estimates transcription and utterance-level timestamp tokens of multiple speakers one after another. This enables joint modeling of multi-talker ASR and timestamps prediction without changing the simple autoregressive modeling of the conventional multi-talker ASR. Experimental results show that our method outperforms the ASR performance of conventional autoregressive multi-talker ASR without timestamp prediction and achieves promising times-tamp prediction accuracy.
Computer Science
What problem does this paper attempt to address?