Noisy student-teacher training for robust keyword spotting

Hyun-Jin Park,Pai Zhu,Ignacio Lopez Moreno,Niranjan Subrahmanya
DOI: https://doi.org/10.48550/arXiv.2106.01604
2021-06-03
Abstract:We propose self-training with noisy student-teacher approach for streaming keyword spotting, that can utilize large-scale unlabeled data and aggressive data augmentation. The proposed method applies aggressive data augmentation (spectral augmentation) on the input of both student and teacher and utilize unlabeled data at scale, which significantly boosts the accuracy of student against challenging conditions. Such aggressive augmentation usually degrades model performance when used with supervised training with hard-labeled data. Experiments show that aggressive spec augmentation on baseline supervised training method degrades accuracy, while the proposed self-training with noisy student-teacher training improves accuracy of some difficult-conditioned test sets by as much as 60%.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the robustness and accuracy of the Keyword Spotting (KWS) model when facing challenging conditions (such as accented, noisy, and far - field environments). Specifically, the author proposes a new self - training method. By introducing the noisy student - teacher framework and combining large - scale unlabeled data with aggressive data augmentation techniques (such as spectral augmentation), the performance of the KWS model is enhanced. ### Main problems 1. **Requirement for high - quality labeled data**: Traditional supervised learning methods require a large amount of high - quality labeled data, which is not only costly but also difficult to obtain. 2. **Negative impact of aggressive data augmentation**: Aggressive data augmentation techniques (such as spectral augmentation) may lead to a decline in model performance in supervised learning because it may convert positive samples into negative samples, thereby increasing the false accept rate. 3. **Utilization of unlabeled data**: How to effectively utilize large - scale unlabeled data to improve model performance is a key issue. ### Solutions The author proposes a two - stage self - training method: 1. **First stage**: Train a teacher model (baseline model) using traditional supervised learning methods. This model uses labeled data and classic data augmentation methods. 2. **Second stage**: Use the soft - labels generated by the teacher model to train the student model. In this stage, simultaneously apply aggressive data augmentation (such as spectral augmentation) to the inputs of the teacher and the student, and use unlabeled data for training. ### Key innovation points - **Aggressive data augmentation**: By simultaneously applying aggressive data augmentation to the inputs of the teacher and the student, it is ensured that the teacher can dynamically adjust the soft - labels according to the changes in the input, avoiding the inaccuracy of hard - labels under aggressive augmentation. - **Utilization of large - scale unlabeled data**: Through the self - training framework, the unlabeled data is fully utilized, further improving the generalization ability of the model. - **Improved self - training method**: Different from traditional self - training methods, the method in this paper applies aggressive augmentation to both the teacher and the student in the same round, which is especially suitable for binary classification tasks (such as KWS) because the space of positive samples in such tasks is relatively small. ### Experimental results The experimental results show that the proposed noisy student - teacher self - training method significantly improves the accuracy of the model on multiple challenging test sets, especially in far - field, accented, and noisy environments. The accuracy of some test sets under difficult conditions has been increased by as much as 60%. ### Formula summary The calculation formula of the loss function involved in the paper is as follows: \[ \text{Student - Teacher Loss}=\alpha\times\text{Loss}_E+\text{Loss}_D \] where: - \(\text{Loss}_D\) is the cross - entropy loss of the decoder output: \[ \text{Loss}_D = \text{cross entropy}(y_T^d, y_S^d) \] - \(\text{Loss}_E\) is the cross - entropy loss of the encoder output: \[ \text{Loss}_E=\text{cross entropy}(y_T^e, y_S^e) \] - \(y_T = [y_T^d, y_T^e]=f_T(\text{augment}(x))\) and \(y_S = [y_S^d, y_S^e]=f_S(\text{augment}(x))\) respectively represent the outputs of the teacher and student models on the augmented inputs. Through these improvements, the paper successfully solves the key challenges in keyword detection, especially the robustness problem in complex environments.