Improving Device Directedness Classification of Utterances with Semantic Lexical Features

Kellen Gillespie,Ioannis C. Konstantakopoulos,Xingzhi Guo,Vishal Thanvantri Vasudevan,Abhinav Sethy

DOI: https://doi.org/10.1109/ICASSP40776.2020.9054304

2020-09-30

Abstract:User interactions with personal assistants like Alexa, Google Home and Siri are typically initiated by a wake term or wakeword. Several personal assistants feature "follow-up" modes that allow users to make additional interactions without the need of a wakeword. For the system to only respond when appropriate, and to ignore speech not intended for it, utterances must be classified as device-directed or non-device-directed. State-of-the-art systems have largely used acoustic features for this task, while others have used only lexical features or have added LM-based lexical features. We propose a directedness classifier that combines semantic lexical features with a lightweight acoustic feature and show it is effective in classifying directedness. The mixed-domain lexical and acoustic feature model is able to achieve 14% relative reduction of EER over a state-of-the-art acoustic-only baseline model. Finally, we successfully apply transfer learning and semi-supervised learning to the model to improve accuracy even further.

Audio and Speech Processing,Computation and Language,Machine Learning,Sound

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to more accurately classify whether voice commands in personal assistant devices (such as Alexa, Google Home, and Siri) are directed at the device itself. Specifically, when users are using the "continuous conversation mode", that is, when they can issue subsequent commands without using the wake - up word again after initially waking up the device, the system needs to be able to distinguish which voice commands are device - directed and which are non - device - directed. This problem is very important for enhancing the user experience because if the system wrongly responds to non - target commands, it may cause user confusion or misunderstanding. To achieve this goal, the author proposes a new classifier that combines semantic lexical features and lightweight acoustic features to improve the classification accuracy of the target - directedness of voice commands. In addition, the paper also explores how to further optimize the model performance through transfer learning and semi - supervised learning techniques, thereby reducing the workload of data annotation and improving the generalization ability of the model. The experimental results show that, compared with the baseline model using only acoustic features, the proposed model has a 14% relative reduction in the Equal Error Rate (EER), and further improves the relative performance by 5% through semi - supervised learning.

Improving Device Directedness Classification of Utterances with Semantic Lexical Features

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Nonverbal Sound Detection for Disordered Speech

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

STEER: Semantic Turn Extension-Expansion Recognition for Voice Assistants

Speech Enhancement for Wake-Up-Word detection in Voice Assistants

Robust Dual-Modal Speech Keyword Spotting for XR Headsets

Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles

On‐device Audio‐visual Multi‐person Wake Word Spotting

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

Enhancing Virtual Assistant Intelligence: Precise Area Targeting for Instance-level User Intents beyond Metadata

Efficient Self-Attention Model for Speech Recognition-Based Assistive Robots Control

Personalized Speech Recognizer With Keyword-Based Personalized Lexicon And Language Model Using Word Vector Representations

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Building competitive direct acoustics-to-word models for English conversational speech recognition