Abstract:Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.

Dynamic Speech Endpoint Detection with Regression Targets

Design and Implementation of End-Point Detection Accelerator for Speech Recognition

Two-pass Endpoint Detection for Speech Recognition

Precise Detection of Speech Endpoints Dynamically: A Wavelet Convolution based approach

Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Device-directed Utterance Detection

Endpoint Detect Method of Embedded Speech Recognition System

A Robust Algorithm For Real-Time Endpoint Detection In The Noisy Mobile Environments

Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems

Effective Speech Endpoint Detection Algorithm For Voiceprint Recognition

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Endophasia

End-Point Detection with State Transition Model based on Chunk-Wise Classification

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Target Active Speaker Detection with Audio-visual Cues