Abstract:Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to optimize the Personal Voice Activity Detection (Personal VAD) model so that it can operate effectively in the on - device speech recognition system. Specifically, the paper focuses on the following key challenges: 1. **Performance Requirements**: In both enrollment and enrollment - less scenarios, the performance of the model needs to be satisfactory. When there is enrollment, the model should minimize the insertion errors of non - target speakers while avoiding deleting the speech of target speakers; when there is no enrollment, the model should perform at least as well as the standard VAD. 2. **Stream Processing**: The model needs to support stream processing, that is, it can process the input speech data in real - time. 3. **Model Size**: The model needs to be small enough to fit the limited latency and CPU/memory budget on the device. To address these challenges, the paper proposes the following innovative designs: 1. **Advanced Speaker Embedding Modulation Method**: - **FiLM Layer**: Scale and shift the input features through the Feature - wise Linear Modulation (FiLM) layer to better fuse acoustic features and speaker embeddings. - **Speaker Pre - network**: Extract speaker information from acoustic features through a pre - network, calculate the cosine similarity with the target speaker embedding, and then use it to condition the model. 2. **New Training Paradigm**: - During the training process, randomly replace part of the target speaker embeddings with zero vectors and modify the labels of non - target speakers to the target speaker labels, so that the model can also exhibit the behavior of the standard VAD under the enrollment - less condition. 3. **Architecture and Runtime Optimization**: - **Conformer Backbone**: Use the Conformer architecture instead of the traditional Bidirectional LSTM (BLSTM) to improve the accuracy and stream processing ability of the model. - **Model Quantization**: Quantize the model weights to 8 - bit integers, reduce the model size and improve the inference speed. Through these improvements, the method proposed in the paper has achieved state - of - the - art performance in on - device speech recognition tasks, significantly reducing the ASR insertion errors under the enrollment condition, while maintaining performance comparable to the standard VAD under the enrollment - less condition.

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Personal VAD: Speaker-Conditioned Voice Activity Detection

A Novel and Efficient Voice Activity Detector Using Shape Features of Speech Wave.

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

SVVAD: Personal Voice Activity Detection for Speaker Verification

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Real-time Architecture for Audio-Visual Active Speaker Detection.

Consistency Based Unsupervised Self-training For ASR Personalisation

An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

A Real-Time Voice Activity Detection Based On Lightweight Neural

Advancing VAD Systems Based on Multi-Task Learning with Improved Model Structures

Personalized Predictive ASR for Latency Reduction in Voice Assistants

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

Personalized Speech Recognizer With Keyword-Based Personalized Lexicon And Language Model Using Word Vector Representations

AVATAR: Robust Voice Search Engine Leveraging Autoregressive Document Retrieval and Contrastive Learning