Abstract:The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the invocation accuracy and efficiency of Voice Assistants (VAs) without trigger words on smart watches. Specifically, the paper focuses on improving the Raise To Speak (RTS) function through multimodal fusion technology, that is, combining gesture and audio data. Existing RTS systems mainly rely on heuristic methods and designed artificial Finite State Machines (FSMs) to fuse gesture and audio data for decision - making, but these methods have limitations such as poor adaptability, insufficient scalability, and introducing human - made biases. Therefore, the paper proposes a neural - network - based multimodal fusion system, aiming at: 1. **Better understanding the temporal correlation between audio and gesture data**, so as to achieve more accurate invocation. 2. **Generalizing to a wide range of environments and scenarios**, improving the adaptability and robustness of the system. 3. **Being lightweight and suitable for deployment on low - power - consumption devices (such as smart watches)**, with a fast startup time. 4. **Improving productivity in the asset development process**, simplifying the complexity of system design. To achieve the above goals, the paper proposes a lightweight Gated Recurrent Unit (GRU) network named Neural Policy, which is used to integrate voice and gesture signals and make a binary decision on whether to trigger RTS. This method adopts a late - fusion strategy, processing the voice and gesture modalities separately before fusion, in order to cope with the resource limitations of low - power - consumption devices. Experimental results show that this method has a significant reduction in False Rejection Rate (FRR) and False Acceptance Rate (FAR) compared with existing methods, while maintaining low computational resource consumption and a fast response time.

Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

A New Mmwave-Speech Multimodal Speech System for Voice User Interface

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Improving Voice Trigger Detection with Metric Learning

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Leveraging Topics and Audio Features with Multimodal Attention for Audio Visual Scene-Aware Dialog

Multipurpose Virtual Assistant Using Machine Learning

Design and implementation of smart voice assistant and recognizing academic words

Exploring Interactive Gestures with Voice Assistant on HMDs in Social Situations

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Nonverbal Sound Detection for Disordered Speech

Voice Assistant for Blind Person

VOICE BASED VIRTUAL ASSISTANT

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Continuous Authentication for Voice Assistants

Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device Sensing

Gesture Controlled Virtual Mouse with Voice Assistant

Multichannel Voice Trigger Detection Based on Transform-average-concatenate