Abstract:Speech evaluation measures a learners oral proficiency using automatic models. Corpora for training such models often pose sparsity challenges given that there often is limited scored data from teachers, in addition to the score distribution across proficiency levels being often imbalanced among student cohorts. Automatic scoring is thus not robust when faced with under-represented samples or out-of-distribution samples, which inevitably exist in real-world deployment scenarios. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization to approximate subjective evaluation criteria. In particular, normalized mutual information is used to quantify the speech characteristics from the learner and the reference. An anchor model is trained using pseudo labels to predict the correctness of pronunciation. An interpolated loss function is proposed to minimize not only the prediction error with respect to ground-truth scores but also the divergence between two probability distributions estimated by the speech evaluation model and the anchor model. Compared to other state-of-the-art methods on a public data-set, this approach not only achieves high performance while evaluating the entire test-set as a whole, but also brings the most evenly distributed prediction error across distinct proficiency levels. Furthermore, empirical results show the model accuracy on out-of-distribution data also compares favorably with competitive baselines.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the robustness of automatic speech evaluation models in the face of data imbalance and insufficient samples. Specifically, when training an automatic model for evaluating learners' spoken language proficiency, the following challenges are usually faced: 1. **Data Sparsity**: The data sets required for training such models often have the problem of sparsity, that is, the scoring data (especially the scores from teachers) are limited. 2. **Uneven Data Distribution**: The score distributions in different student groups are often unbalanced, and there are fewer samples at certain proficiency levels. 3. **Poor Generalization Ability**: For the above reasons, automatic scoring performs poorly when facing under - represented samples or out - of - distribution (OOD) samples, resulting in a lack of robustness and fairness in the model during actual deployment. To solve these problems, the paper proposes a two - stage training method that combines semi - supervised pre - training and target regularization to improve the robustness of the model in dealing with data imbalance and insufficient samples. Specific measures include: - **Introducing Normalized Mutual Information (NMI) as a Metric**: Use mutual information to quantify the similarity between the learner's speech features and the reference speech. - **Semi - supervised Pre - training with Pseudo - labels**: Automatically generate pseudo - labels to increase the diversity of training data and prevent the model from over - fitting to a small amount of labeled data. - **Interpolated Loss Function**: During the training process, minimize the prediction error and the difference between the two probability distributions through the interpolated loss function, thereby improving the model's prediction accuracy for different proficiency levels. These improvements make the model not only perform well on the overall test set, but also have a more uniform distribution of prediction errors at different proficiency levels, and especially have good performance when facing OOD data. ### Formula Summary - **Mutual Information Formula**: \[ I(T, \hat{T})=\sum_{i = 1}^{C}\sum_{j = 1}^{C}p(w_i, w_j)\log\frac{p(w_i, w_j)}{p(w_i)p(w_j)} \] where \(w_i\) is the \(i\)-th phoneme and \(C\) is the cardinality of the phoneme set. - **Normalized Mutual Information Formula**: \[ NI(T, \hat{T})=\frac{2\times I(T, \hat{T})}{H(T)+H(\hat{T})} \] where \(H(T)\) and \(H(\hat{T})\) are the entropies (uncertainties) of variables \(T\) and \(\hat{T}\), respectively. - **Cross - Entropy Loss Formula**: \[ D =-\frac{1}{N}\sum_{t = 1}^{N}\sum_{s = 1}^{S}\tilde{p}(y|O_t, T_t)\log[p(y|O_t, T_t)] \] - **Kullback - Leibler Divergence (KLD) Formula**: \[ D_{KL}(p\|q)=\frac{1}{N}\sum_{t = 1}^{N}p\cdot\log\left(\frac{q}{p}\right) \] - **Interpolated Mean Squared Error (iMSE) Loss Formula**: \[ D =-\frac{1-\rho}{N}\sum_{t = 1}^{N}(y - s_t)^2-\frac{\rho}{N}\sum_{t = 1}^{N}\left[\exp\left(-\frac{(\hat{s}_t - s_t)^2}{2}\right)\cdot(y - \hat{s}_t)^2\right] \] Through these methods, the paper aims to improve the robustness and generalization ability of automatic speech evaluation models, making them more reliable and fair in practical applications.

Semi-supervised Learning For Robust Speech Evaluation

Utilizing Self-supervised Representations for MOS Prediction

Robust Spoken Language Understanding With Unsupervised Asr-Error Adaptation

SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

Progressive Multi-scale Self-supervised Learning for Speech Recognition

Residual-Guided Non-Intrusive Speech Quality Assessment

SUPERB: Speech Understanding and PERformance Benchmark

Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation

A Supervised Speech Enhancement Approach with Residual Noise Control for Voice Communication.

SpeechLMScore: Evaluating speech generation using speech language model

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

A Weakly Supervised Learning Approach for Spoken Language Understanding.

Perceptual Evaluation of Pronunciation Quality for Computer Assisted Language Learning

Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility

Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability

SUPERB-SG: Enhanced Speech Processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Semi-Supervised Spoken Language Understanding Via Self-Supervised Speech and Language Model Pretraining.