Abstract:Automatic Speech Assessment (ASA) has seen notable advancements with the utilization of self-supervised features (SSL) in recent research. However, a key challenge in ASA lies in the imbalanced distribution of data, particularly evident in English test datasets. To address this challenge, we approach ASA as an ordinal classification task, introducing Weighted Vectors Ranking Similarity (W-RankSim) as a novel regularization technique. W-RankSim encourages closer proximity of weighted vectors in the output layer for similar classes, implying that feature vectors with similar labels would be gradually nudged closer to each other as they converge towards corresponding weighted vectors. Extensive experimental evaluations confirm the effectiveness of our approach in improving ordinal classification performance for ASA. Furthermore, we propose a hybrid model that combines SSL and handcrafted features, showcasing how the inclusion of handcrafted features enhances performance in an ASA system.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in Automatic Speech Assessment (ASA): 1. **Data imbalance problem**: In the English test dataset, the score distribution usually presents a normal distribution, resulting in very few data points for the lowest and highest scores. This data imbalance phenomenon has a negative impact on the training and performance of the model. 2. **Ordinal classification problem**: Traditional ASA methods usually regard the task as a regression or classification problem and use the mean - squared - error loss or cross - entropy loss function for training. However, these loss functions ignore the ordinal nature of scores (i.e., the order relationship between scores) and are easily affected by data imbalance. To solve these problems, the author proposes the following methods: - **W - RankSim Regularization Technique**: This is a new regularization technique, aiming to capture the ordinal relationship between class labels by weighting the similarity in the vector space. W - RankSim encourages feature vectors with similar labels to be closer in the output layer, thus improving the performance of the ordinal classification task. The specific formula is as follows: \[ L_{\text{W - RankSim}}=\sum_{i = 1}^{|C|}l(\text{rk}(S_c[i,:]),\text{rk}(S_w[i,:])) \] where \(S_c\) is the similarity matrix in the label space, \(S_w\) is the similarity matrix in the weighted vector space, and \(l\) is the ranking similarity function (the mean - squared - error is used as \(l\) in this paper). - **Hybrid model**: A hybrid model is constructed by combining self - supervised learning (SSL) features and hand - designed features. This model includes three main components: content, delivery, and language use. Experimental results show that adding hand - designed features can significantly improve the performance of the ASA system. Through these methods, the paper successfully solves the data imbalance and ordinal classification problems in ASA and verifies its effectiveness in multiple experiments.

Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies

An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

Towards Automatic Assessment of Self-Supervised Speech Models using Rank

Hybrid Approach to Automated Essay Scoring: Integrating Deep Learning Embeddings with Handcrafted Linguistic Features for Improved Accuracy

Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Semi-supervised Learning For Robust Speech Evaluation

Preference-based training framework for automatic speech quality assessment using deep neural network

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Aggregating Multiple Heuristic Signals As Supervision for Unsupervised Automated Essay Scoring.

An ASR-free Fluency Scoring Approach with Self-Supervised Learning

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations