Abstract:Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner's speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the three main data - related challenges in Automated Speaking Assessment (ASA): 1. **Limited labeled data**: Existing ASA systems lack sufficient labeled data, which restricts the training and generalization ability of the model. 2. **Uneven distribution of learner proficiency**: The data of learners at different CEFR (Common European Framework of Reference for Languages) levels are extremely unevenly distributed, resulting in poor performance of the model when dealing with rare categories. 3. **Uneven scoring intervals between different CEFR levels**: For example, the gap between B2 and B1 is not equal to the gap between B1 and A2. This non - uniformity makes it difficult for traditional regression methods to handle effectively. To solve these problems, the author proposes two novel modeling strategies: - **Metric - based Classification**: By introducing Prototypical Networks and using different similarity functions (such as cosine similarity and squared Euclidean distance), the data imbalance problem is alleviated, and the non - uniform scoring intervals between different CEFR levels are effectively handled. - **Loss Re - weighting**: The loss function is re - weighted according to the frequency distribution of CEFR levels and its reciprocal to increase the model's attention to rare categories. The experimental results show that these strategies significantly improve the accuracy of CEFR prediction on the ICNALE benchmark dataset. Compared with the existing strong baseline models, the accuracy is improved by more than 10%. Specifically, the W2V - PT(SED)+LW model in the best configuration improves the accuracy from 77.88% to 92.63%. In addition, the paper also explores the impact of different initialization methods on model performance, and further verifies the effectiveness and robustness of the proposed methods through the analysis of confusion matrices, the classification performance of learners with different native languages, and embedding visualization.

An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution

Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies

An ASR-free Fluency Scoring Approach with Self-Supervised Learning

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning

Addressing Cold Start Problem for End-to-end Automatic Speech Scoring

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Semi-Supervised Learning with Data Augmentation for End-to-End ASR

SOA: Reducing Domain Mismatch in SSL Pipeline by Speech Only Adaptation for Low Resource ASR

Multi-objective Non-intrusive Hearing-aid Speech Assessment Model

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Automated Speech Scoring System Under The Lens: Evaluating and interpreting the linguistic cues for language proficiency

A Quantitative Approach to Understand Self-Supervised Models as Cross-lingual Feature Extractors

Automatic Screening for Children with Speech Disorder using Automatic Speech Recognition: Opportunities and Challenges

Self-supervised Speech Representations Still Struggle with African American Vernacular English

Exploring SSL Discrete Tokens for Multilingual ASR

More Speaking or More Speakers?

Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring