Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

Sai Srujana Buddi,Utkarsh Oggy Sarawgi,Tashweena Heeramun,Karan Sawnhey,Ed Yanosik,Saravana Rathinam,Saurabh Adya
2023-05-20
Abstract:The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.
Machine Learning,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the invocation accuracy and efficiency of Voice Assistants (VAs) without trigger words on smart watches. Specifically, the paper focuses on improving the Raise To Speak (RTS) function through multimodal fusion technology, that is, combining gesture and audio data. Existing RTS systems mainly rely on heuristic methods and designed artificial Finite State Machines (FSMs) to fuse gesture and audio data for decision - making, but these methods have limitations such as poor adaptability, insufficient scalability, and introducing human - made biases. Therefore, the paper proposes a neural - network - based multimodal fusion system, aiming at: 1. **Better understanding the temporal correlation between audio and gesture data**, so as to achieve more accurate invocation. 2. **Generalizing to a wide range of environments and scenarios**, improving the adaptability and robustness of the system. 3. **Being lightweight and suitable for deployment on low - power - consumption devices (such as smart watches)**, with a fast startup time. 4. **Improving productivity in the asset development process**, simplifying the complexity of system design. To achieve the above goals, the paper proposes a lightweight Gated Recurrent Unit (GRU) network named Neural Policy, which is used to integrate voice and gesture signals and make a binary decision on whether to trigger RTS. This method adopts a late - fusion strategy, processing the voice and gesture modalities separately before fusion, in order to cope with the resource limitations of low - power - consumption devices. Experimental results show that this method has a significant reduction in False Rejection Rate (FRR) and False Acceptance Rate (FAR) compared with existing methods, while maintaining low computational resource consumption and a fast response time.