SWIM: An Attention-Only Model for Speech Quality Assessment Under Subjective Variance

Imran E Kibria,Donald S. Williamson
2024-10-16
Abstract:Speech quality is best evaluated by human feedback using mean opinion scores (MOS). However, variance in ratings between listeners can introduce noise in the true quality label of an utterance. Currently, deep learning networks including convolutional, recurrent, and attention-based architectures have been explored for quality estimation. This paper proposes an exclusively attention-based model involving a Swin Transformer for MOS estimation (SWIM). Our network captures local and global dependencies that reflect the acoustic properties of an utterance. To counteract subjective variance in MOS labels, we propose a normal distance-based objective that accounts for standard deviation in each label, and we avail a multistage self-teaching strategy to improve generalization further. Our model is significantly more compact than existing attention-based networks for quality estimation. Finally, our experiments on the Samsung Open Mean Opinion Score (SOMOS) dataset show improvement over existing baseline models when trained from scratch.
Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to accurately evaluate speech quality in the presence of differences in subjective scoring**. Specifically, the paper aims to predict the Mean Opinion Score (MOS) directly from the speech signal by constructing a model (SWIM) based solely on the attention mechanism, in order to overcome several challenges in existing methods: 1. **Differences in subjective scoring**: Different listeners may have large differences in their quality evaluations of the same piece of speech, which will introduce noise and affect the model's estimation of the true quality label. 2. **Data scarcity**: High - quality human - annotated data is difficult to obtain and costly, resulting in scarce training data. 3. **Imbalanced label distribution**: There are relatively few high - quality and low - quality speech samples, while there are more medium - quality samples. To solve these problems, the paper proposes a brand - new attention - mechanism model - **SWIM (Swin Transformer for MOS Estimation)**, whose main features include: - **Model architecture**: SWIM uses Swin Transformer to capture local and global dependencies in the speech signal, thereby better reflecting the acoustic characteristics of speech. - **Loss function**: To deal with the differences in subjective scoring, the paper proposes a loss function based on Mahalanobis distance. This loss function takes into account the standard deviation of each label, thus more reasonably evaluating the gap between the model prediction and the true label. - **Self - teaching strategy**: Through a multi - stage self - teaching strategy, the model can gradually reduce the noise caused by differences in subjective scoring and further improve its generalization ability. Experimental results show that SWIM outperforms existing baseline models on the Samsung Open Mean Opinion Score (SOMOS) dataset, especially when trained from scratch, showing better error and correlation metrics. ### Formula summary - **Loss function**: \[ l=\log_{10}\left(1+\frac{|y - \mu|}{\sigma+\epsilon}\right) \] where \(y\) is the model prediction value, \(\mu\) is the Mean Opinion Score (MOS), \(\sigma\) is the standard deviation of the score, and \(\epsilon\) is a small constant (such as 0.01) to avoid the case where the denominator is zero. - **Self - teaching label aggregation**: \[ y_t = (\alpha_0\cdot\mu)+\sum_{i = 1}^{m}(\alpha_i\cdot t_{i - 1}) \] where \(\mu\) is the MOS label in the dataset, \(t_{i - 1}\) is the prediction value of the model in the previous stage, and \(\alpha_i\) is the weight coefficient, ensuring that the sum of all weights is 1. Through these innovations, SWIM can more accurately evaluate speech quality in the presence of differences in subjective scoring.