Abstract:Keyword Spotting (KWS) is a critical aspect of audio-based applications on mobile devices and virtual assistants. Recent developments in Federated Learning (FL) have significantly expanded the ability to train machine learning models by utilizing the computational and private data resources of numerous distributed devices. However, existing FL methods typically require that devices possess accurate ground-truth labels, which can be both expensive and impractical when dealing with local audio data. In this study, we first demonstrate the effectiveness of Semi-Supervised Federated Learning (SSL) and FL for KWS. We then extend our investigation to Semi-Supervised Federated Learning (SSFL) for KWS, where devices possess completely unlabeled data, while the server has access to a small amount of labeled data. We perform numerical analyses using state-of-the-art SSL, FL, and SSFL techniques to demonstrate that the performance of KWS models can be significantly improved by leveraging the abundant unlabeled heterogeneous data available on devices.

What problem does this paper attempt to address?

The paper aims to address several key issues in the task of Keyword Spotting (KWS), especially the challenges faced when applied on mobile devices and virtual assistants. Specifically, the paper focuses on the following aspects: 1. **Utilizing Unlabeled Data**: Existing Federated Learning (FL) methods typically require precise labeled data on the device side, which is both expensive and impractical for local audio data. Therefore, the researchers propose a Semi-Supervised Federated Learning (SSFL) framework that can fully utilize the large amount of unlabeled data on the device side with only a small amount of labeled data on the server side. 2. **Addressing the Non-Independent and Identically Distributed (Non-IID) Problem**: In federated learning, the data distribution across different clients may vary significantly. This paper effectively mitigates this issue and improves model performance through alternate training techniques, combining Semi-Supervised Learning (SSL) and federated learning. 3. **Application of Data Augmentation Techniques**: To better utilize unlabeled data, the researchers explored various data augmentation methods, including basic augmentation, SpecAugment, RandAugment, and MixAugment, to further enhance the performance of the KWS model. 4. **Transfer of Pre-trained Models**: When a large amount of labeled data is available, SSFL can adapt to new data domains by fine-tuning pre-trained models to improve performance. Experimental results show that starting training or transfer learning from pre-trained models can significantly enhance the performance of KWS models with a small amount of labeled data. In summary, this paper aims to improve the overall performance of the keyword spotting task by effectively utilizing the rich unlabeled data resources on the device side through a semi-supervised federated learning approach, and proposes effective solutions for non-independent and identically distributed data.

Semi-Supervised Federated Learning for Keyword Spotting

Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding

Avoid Overfitting User Specific Information in Federated Keyword Spotting

Improving Small Footprint Few-shot Keyword Spotting with Supervision on Auxiliary Data

Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

Speech Augmentation Based Unsupervised Learning for Keyword Spotting

Federated Learning for Audio Semantic Communication

Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology

Improving Label-Deficient Keyword Spotting Through Self-Supervised Pretraining

Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics

Audio-visual Keyword Spotting for Mandarin Based on Discriminative Local Spatial-Temporal Descriptors.

Exploring Representation Learning for Small-Footprint Keyword Spotting

Noise-Robust Keyword Spotting through Self-supervised Pretraining

Bridging the Gap Between Audio and Text Using Parallel-Attention for User-Defined Keyword Spotting

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework Based on Cascaded Transducer-Transformer.

Conditional Online Learning for Keyword Spotting

Federated Learning With Highly Imbalanced Audio Data

Zero-Shot Federated Learning with New Classes for Audio Classification

(FL)$^2$: Overcoming Few Labels in Federated Semi-Supervised Learning

A Hybrid Self-Supervised Learning Framework for Vertical Federated Learning