Boosting keyword spotting through on-device learnable user speech characteristics

Cristian Cioflan,Lukas Cavigelli,Luca Benini
2024-03-13
Abstract:Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the restricted application scenarios of TinyML (Tiny Machine Learning), how to improve the accuracy of Keyword Spotting (KWS) systems through on - device learning. Specifically, when these systems are deployed in unknown inference environments, they need to be adjusted on - site to adapt to the voice characteristics of target users, thereby improving the accuracy of the classifier. ### Problem Background Keyword spotting systems are widely used in smart speakers, smart phones and other Internet of Things devices, enabling users to interact with devices through natural language and voice commands. However, these systems face the following challenges in actual deployment: 1. **Environmental Changes**: Environmental factors such as background noise, reverberation or echo will affect the performance of the system. 2. **Differences in User Voices**: The voice characteristics (such as pitch, intonation) of different users and voice disorders (such as stuttering) will lead to a decline in system performance. 3. **Data Scarcity**: In real - world scenarios, it is very difficult to obtain a large number of labeled samples for online learning. 4. **Resource Limitations**: Battery - powered edge devices usually have limited computing resources and cannot support complex model update methods. ### Solutions Proposed in the Paper To solve the above problems, the paper proposes a new on - device learning architecture, which consists of a pre - trained backbone network and a user - aware embedding layer. Specific improvements include: - **Lightweight Backbone Network**: Use a pre - trained depth - wise separable convolutional neural network (DS - CNN), which is suitable for low - resource devices. - **User - Aware Embedding Layer**: Learn the voice characteristics of the target user through the embedding layer, and fuse these characteristics with the features extracted by the backbone network for classification. - **Few - Sample Learning**: This architecture can perform effective online learning with only a small number of samples, significantly reducing the error rate. - **Low - Power Consumption and Low - Memory Requirements**: The entire system only requires 23.7k parameters and 1 MFLOP per training round, which is suitable for ultra - low - power, memory - constrained platforms. ### Experimental Results Through experimental verification on the Google Speech Commands dataset, this method achieved an error rate reduction of up to 19% on 35 - class problems, especially when dealing with unseen speakers. In addition, this method also demonstrated its few - sample learning ability under conditions of a small number of samples and scarce categories. ### Summary The method proposed in the paper not only improves the accuracy of keyword spotting systems, but also has feasibility and high efficiency on resource - constrained edge devices, providing new ideas for achieving more intelligent and personalized voice interactions.