Abstract:Speech quality is best evaluated by human feedback using mean opinion scores (MOS). However, variance in ratings between listeners can introduce noise in the true quality label of an utterance. Currently, deep learning networks including convolutional, recurrent, and attention-based architectures have been explored for quality estimation. This paper proposes an exclusively attention-based model involving a Swin Transformer for MOS estimation (SWIM). Our network captures local and global dependencies that reflect the acoustic properties of an utterance. To counteract subjective variance in MOS labels, we propose a normal distance-based objective that accounts for standard deviation in each label, and we avail a multistage self-teaching strategy to improve generalization further. Our model is significantly more compact than existing attention-based networks for quality estimation. Finally, our experiments on the Samsung Open Mean Opinion Score (SOMOS) dataset show improvement over existing baseline models when trained from scratch.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to accurately evaluate speech quality in the presence of differences in subjective scoring**. Specifically, the paper aims to predict the Mean Opinion Score (MOS) directly from the speech signal by constructing a model (SWIM) based solely on the attention mechanism, in order to overcome several challenges in existing methods: 1. **Differences in subjective scoring**: Different listeners may have large differences in their quality evaluations of the same piece of speech, which will introduce noise and affect the model's estimation of the true quality label. 2. **Data scarcity**: High - quality human - annotated data is difficult to obtain and costly, resulting in scarce training data. 3. **Imbalanced label distribution**: There are relatively few high - quality and low - quality speech samples, while there are more medium - quality samples. To solve these problems, the paper proposes a brand - new attention - mechanism model - **SWIM (Swin Transformer for MOS Estimation)**, whose main features include: - **Model architecture**: SWIM uses Swin Transformer to capture local and global dependencies in the speech signal, thereby better reflecting the acoustic characteristics of speech. - **Loss function**: To deal with the differences in subjective scoring, the paper proposes a loss function based on Mahalanobis distance. This loss function takes into account the standard deviation of each label, thus more reasonably evaluating the gap between the model prediction and the true label. - **Self - teaching strategy**: Through a multi - stage self - teaching strategy, the model can gradually reduce the noise caused by differences in subjective scoring and further improve its generalization ability. Experimental results show that SWIM outperforms existing baseline models on the Samsung Open Mean Opinion Score (SOMOS) dataset, especially when trained from scratch, showing better error and correlation metrics. ### Formula summary - **Loss function**: \[ l=\log_{10}\left(1+\frac{|y - \mu|}{\sigma+\epsilon}\right) \] where \(y\) is the model prediction value, \(\mu\) is the Mean Opinion Score (MOS), \(\sigma\) is the standard deviation of the score, and \(\epsilon\) is a small constant (such as 0.01) to avoid the case where the denominator is zero. - **Self - teaching label aggregation**: \[ y_t = (\alpha_0\cdot\mu)+\sum_{i = 1}^{m}(\alpha_i\cdot t_{i - 1}) \] where \(\mu\) is the MOS label in the dataset, \(t_{i - 1}\) is the prediction value of the model in the previous stage, and \(\alpha_i\) is the weight coefficient, ensuring that the sum of all weights is 1. Through these innovations, SWIM can more accurately evaluate speech quality in the presence of differences in subjective scoring.

SWIM: An Attention-Only Model for Speech Quality Assessment Under Subjective Variance

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Attention-Based Speech Enhancement Using Human Quality Perception Modeling

Attention-based Speech Enhancement Using Human Quality Perception Modelling

SWIM: Short-Window CNN Integrated with Mamba for EEG-Based Auditory Spatial Attention Decoding

Utilizing Self-supervised Representations for MOS Prediction

NORESQA: A Framework for Speech Quality Assessment using Non-Matching References

Analysis of XLS-R for Speech Quality Assessment

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

Residual-Guided Non-Intrusive Speech Quality Assessment

A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality

CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment

Pose Estimation for Swimmers in Video Surveillance

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

R-MSSIM: Image quality assessment while performing object detection

Speech MOS multi-task learning and rater bias correction

MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

Speech quality estimation with deep lattice networks

Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models

Semi-supervised Learning For Robust Speech Evaluation

DNSMOS P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors