Abstract:We propose self-training with noisy student-teacher approach for streaming keyword spotting, that can utilize large-scale unlabeled data and aggressive data augmentation. The proposed method applies aggressive data augmentation (spectral augmentation) on the input of both student and teacher and utilize unlabeled data at scale, which significantly boosts the accuracy of student against challenging conditions. Such aggressive augmentation usually degrades model performance when used with supervised training with hard-labeled data. Experiments show that aggressive spec augmentation on baseline supervised training method degrades accuracy, while the proposed self-training with noisy student-teacher training improves accuracy of some difficult-conditioned test sets by as much as 60%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the robustness and accuracy of the Keyword Spotting (KWS) model when facing challenging conditions (such as accented, noisy, and far - field environments). Specifically, the author proposes a new self - training method. By introducing the noisy student - teacher framework and combining large - scale unlabeled data with aggressive data augmentation techniques (such as spectral augmentation), the performance of the KWS model is enhanced. ### Main problems 1. **Requirement for high - quality labeled data**: Traditional supervised learning methods require a large amount of high - quality labeled data, which is not only costly but also difficult to obtain. 2. **Negative impact of aggressive data augmentation**: Aggressive data augmentation techniques (such as spectral augmentation) may lead to a decline in model performance in supervised learning because it may convert positive samples into negative samples, thereby increasing the false accept rate. 3. **Utilization of unlabeled data**: How to effectively utilize large - scale unlabeled data to improve model performance is a key issue. ### Solutions The author proposes a two - stage self - training method: 1. **First stage**: Train a teacher model (baseline model) using traditional supervised learning methods. This model uses labeled data and classic data augmentation methods. 2. **Second stage**: Use the soft - labels generated by the teacher model to train the student model. In this stage, simultaneously apply aggressive data augmentation (such as spectral augmentation) to the inputs of the teacher and the student, and use unlabeled data for training. ### Key innovation points - **Aggressive data augmentation**: By simultaneously applying aggressive data augmentation to the inputs of the teacher and the student, it is ensured that the teacher can dynamically adjust the soft - labels according to the changes in the input, avoiding the inaccuracy of hard - labels under aggressive augmentation. - **Utilization of large - scale unlabeled data**: Through the self - training framework, the unlabeled data is fully utilized, further improving the generalization ability of the model. - **Improved self - training method**: Different from traditional self - training methods, the method in this paper applies aggressive augmentation to both the teacher and the student in the same round, which is especially suitable for binary classification tasks (such as KWS) because the space of positive samples in such tasks is relatively small. ### Experimental results The experimental results show that the proposed noisy student - teacher self - training method significantly improves the accuracy of the model on multiple challenging test sets, especially in far - field, accented, and noisy environments. The accuracy of some test sets under difficult conditions has been increased by as much as 60%. ### Formula summary The calculation formula of the loss function involved in the paper is as follows: \[ \text{Student - Teacher Loss}=\alpha\times\text{Loss}_E+\text{Loss}_D \] where: - \(\text{Loss}_D\) is the cross - entropy loss of the decoder output: \[ \text{Loss}_D = \text{cross entropy}(y_T^d, y_S^d) \] - \(\text{Loss}_E\) is the cross - entropy loss of the encoder output: \[ \text{Loss}_E=\text{cross entropy}(y_T^e, y_S^e) \] - \(y_T = [y_T^d, y_T^e]=f_T(\text{augment}(x))\) and \(y_S = [y_S^d, y_S^e]=f_S(\text{augment}(x))\) respectively represent the outputs of the teacher and student models on the augmented inputs. Through these improvements, the paper successfully solves the key challenges in keyword detection, especially the robustness problem in complex environments.

Noisy student-teacher training for robust keyword spotting

Learning with Noisy Labels Via Self-supervised Adversarial Noisy Masking

Meta-Self-Training Based on Teacher–Student Network for Industrial Label-Noise Fault Diagnosis

Noise-Robust Keyword Spotting through Self-supervised Pretraining

Understanding temporally weakly supervised training: A case study for keyword spotting

Self-training with Noisy Student improves ImageNet classification

On-the-fly Denoising for Data Augmentation in Natural Language Understanding

Noise-BERT: A Unified Perturbation-Robust Framework with Noise Alignment Pre-training for Noisy Slot Filling Task

Meta Self-Refinement for Robust Learning with Weak Supervision

Distantly-Supervised Named Entity Recognition with Adaptive Teacher Learning and Fine-grained Student Ensemble

Towards Noise-resistant Object Detection with Noisy Annotations

TeachAugment: Data Augmentation Optimization Using Teacher Knowledge

Self-Train Before You Transcribe

Dynamic training for handling textual label noise

Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Spot keywords from very noisy and mixed speech

Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training

Student-Teacher Learning from Clean Inputs to Noisy Inputs

Adaptive Self-training for Few-shot Neural Sequence Labeling

Enhancing Robustness in Learning with Noisy Labels: an Asymmetric Co-Training Approach