Abstract:We develop two complementary advances for training no-reference (NR) speech quality estimators with independent datasets. Multi-dataset finetuning (MDF) pretrains an NR estimator on a single dataset and then finetunes it on multiple datasets at once, including the dataset used for pretraining. AlignNet uses an AudioNet to generate intermediate score estimates before using the Aligner to map intermediate estimates to the appropriate score range. AlignNet is agnostic to the choice of AudioNet so any successful NR speech quality estimator can benefit from its Aligner. The methods can be used in tandem, and we use two studies to show that they improve on current solutions: one study uses nine smaller datasets and the other uses four larger datasets. AlignNet with MDF improves on other solutions because it efficiently and effectively removes misalignments that impair the learning process, and thus enables successful training with larger amounts of more diverse data.

What problem does this paper attempt to address?

This paper attempts to address the data alignment issue that arises during the training of No-Reference (NR) speech quality assessors in a multi-dataset environment. Specifically: - **Background and Problem**: In different listening experiments, even when the same speech files are rated, the scores obtained can vary due to the "corpus effect." This discrepancy leads to inconsistent training data, thereby affecting the model's training effectiveness. - **Proposed Solution**: - **Multi-Dataset Fine-tuning (MDF)**: First, pre-train the model on one dataset, and then fine-tune the model on multiple datasets, including the initial pre-training dataset. This approach allows the model to better handle the score inconsistencies from different datasets. - **AlignNet Architecture**: Composed of two parts—AudioNet and Aligner. AudioNet is responsible for generating intermediate score estimates from audio or audio features, while Aligner maps these intermediate estimates to the corresponding score ranges based on dataset indicators. AlignNet addresses the score inconsistency issue between different datasets by introducing a small score alignment network, enabling different datasets to work together to improve the training of the NR assessor. - **Experimental Results**: The authors conducted experiments using 13 different datasets, covering 3 languages, 4 different speech attributes, and a wide range of measurement domains, with a total duration of over 300 hours. The experimental results show that the combination of AlignNet and MDF outperforms existing methods across multiple datasets, effectively resolving the score inconsistency issue between different datasets and improving the model's generalization ability. In summary, this paper aims to improve the training effectiveness of NR speech quality assessors in a multi-dataset environment by proposing the MDF and AlignNet methods, thereby enhancing their performance in practical applications.

AlignNet: Learning dataset score alignment functions to enable better training of speech quality estimators

AlignNet: A Unifying Approach to Audio-Visual Alignment

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Using RLHF to align speech enhancement approaches to mean-opinion quality scores

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

One TTS Alignment To Rule Them All

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models

A Neural Time Alignment Module for End-to-End Automatic Speech Recognition

Noise Contrastive Alignment of Language Models with Explicit Rewards

Align-ULCNet: Towards Low-Complexity and Robust Acoustic Echo and Noise Reduction

Neufa: neural network based end-to-end forced alignment with bidirectional attention mechanism

TIPAA-SSL: Text Independent Phone-to-Audio Alignment based on Self-Supervised Learning and Knowledge Transfer

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning

Supervisory Data Alignment for Text-Independent Voice Conversion

The Development of the Cambridge University Alignment Systems for the Multi-Genre Broadcast Challenge.

SpeechAlign: Aligning Speech Generation to Human Preferences

FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations