Abstract:Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the restricted application scenarios of TinyML (Tiny Machine Learning), how to improve the accuracy of Keyword Spotting (KWS) systems through on - device learning. Specifically, when these systems are deployed in unknown inference environments, they need to be adjusted on - site to adapt to the voice characteristics of target users, thereby improving the accuracy of the classifier. ### Problem Background Keyword spotting systems are widely used in smart speakers, smart phones and other Internet of Things devices, enabling users to interact with devices through natural language and voice commands. However, these systems face the following challenges in actual deployment: 1. **Environmental Changes**: Environmental factors such as background noise, reverberation or echo will affect the performance of the system. 2. **Differences in User Voices**: The voice characteristics (such as pitch, intonation) of different users and voice disorders (such as stuttering) will lead to a decline in system performance. 3. **Data Scarcity**: In real - world scenarios, it is very difficult to obtain a large number of labeled samples for online learning. 4. **Resource Limitations**: Battery - powered edge devices usually have limited computing resources and cannot support complex model update methods. ### Solutions Proposed in the Paper To solve the above problems, the paper proposes a new on - device learning architecture, which consists of a pre - trained backbone network and a user - aware embedding layer. Specific improvements include: - **Lightweight Backbone Network**: Use a pre - trained depth - wise separable convolutional neural network (DS - CNN), which is suitable for low - resource devices. - **User - Aware Embedding Layer**: Learn the voice characteristics of the target user through the embedding layer, and fuse these characteristics with the features extracted by the backbone network for classification. - **Few - Sample Learning**: This architecture can perform effective online learning with only a small number of samples, significantly reducing the error rate. - **Low - Power Consumption and Low - Memory Requirements**: The entire system only requires 23.7k parameters and 1 MFLOP per training round, which is suitable for ultra - low - power, memory - constrained platforms. ### Experimental Results Through experimental verification on the Google Speech Commands dataset, this method achieved an error rate reduction of up to 19% on 35 - class problems, especially when dealing with unseen speakers. In addition, this method also demonstrated its few - sample learning ability under conditions of a small number of samples and scarce categories. ### Summary The method proposed in the paper not only improves the accuracy of keyword spotting systems, but also has feasibility and high efficiency on resource - constrained edge devices, providing new ideas for achieving more intelligent and personalized voice interactions.

Boosting keyword spotting through on-device learnable user speech characteristics

Close the Gap Between Deep Learning and Mobile Intelligence by Incorporating Training in the Loop

Explore Training of Deep Convolutional Neural Networks on Battery-powered Mobile Devices: Design and Application

On-Device Domain Learning for Keyword Spotting on Low-Power Extreme Edge Embedded Systems

TinySV: Speaker Verification in TinyML with On-device Learning

Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

On-device Online Learning and Semantic Management of TinyML Systems

On-Device Training Under 256KB Memory

On-device query intent prediction with lightweight LLMs to support ubiquitous conversations

Automated Customization of On-Device Inference for Quality-of-Experience Enhancement

TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices

On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

Dark Experience for Incremental Keyword Spotting

An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models

Training on the Fly: On-device Self-supervised Learning aboard Nano-drones within 20 mW

Conditional Online Learning for Keyword Spotting

Hello Edge: Keyword Spotting on Microcontrollers