AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators

Jaden Pieper,Stephen D. Voran
DOI: https://doi.org/10.21437/Interspeech.2024-74
2024-09-26
Abstract:We develop two complementary advances for training no-reference (NR) speech quality estimators with independent datasets. Multi-dataset finetuning (MDF) pretrains an NR estimator on a single dataset and then finetunes it on multiple datasets at once, including the dataset used for pretraining. AlignNet uses an AudioNet to generate intermediate score estimates before using the Aligner to map intermediate estimates to the appropriate score range. AlignNet is agnostic to the choice of AudioNet so any successful NR speech quality estimator can benefit from its Aligner. The methods can be used in tandem, and we use two studies to show that they improve on current solutions: one study uses nine smaller datasets and the other uses four larger datasets. AlignNet with MDF improves on other solutions because it efficiently and effectively removes misalignments that impair the learning process, and thus enables successful training with larger amounts of more diverse data.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to address the data alignment issue that arises during the training of No-Reference (NR) speech quality assessors in a multi-dataset environment. Specifically: - **Background and Problem**: In different listening experiments, even when the same speech files are rated, the scores obtained can vary due to the "corpus effect." This discrepancy leads to inconsistent training data, thereby affecting the model's training effectiveness. - **Proposed Solution**: - **Multi-Dataset Fine-tuning (MDF)**: First, pre-train the model on one dataset, and then fine-tune the model on multiple datasets, including the initial pre-training dataset. This approach allows the model to better handle the score inconsistencies from different datasets. - **AlignNet Architecture**: Composed of two parts—AudioNet and Aligner. AudioNet is responsible for generating intermediate score estimates from audio or audio features, while Aligner maps these intermediate estimates to the corresponding score ranges based on dataset indicators. AlignNet addresses the score inconsistency issue between different datasets by introducing a small score alignment network, enabling different datasets to work together to improve the training of the NR assessor. - **Experimental Results**: The authors conducted experiments using 13 different datasets, covering 3 languages, 4 different speech attributes, and a wide range of measurement domains, with a total duration of over 300 hours. The experimental results show that the combination of AlignNet and MDF outperforms existing methods across multiple datasets, effectively resolving the score inconsistency issue between different datasets and improving the model's generalization ability. In summary, this paper aims to improve the training effectiveness of NR speech quality assessors in a multi-dataset environment by proposing the MDF and AlignNet methods, thereby enhancing their performance in practical applications.