Abstract:Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.

Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction

End-to-End Joint Target and Non-Target Speakers ASR

Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Alignment-Free Training for Transducer-based Multi-Talker ASR

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

Joint streaming model for backchannel prediction and automatic speech recognition

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Timestamped Embedding-Matching Acoustic-to-Word CTC ASR

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Improved Speech Representations with Multi-Target Autoregressive Predictive Coding

Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding