Abstract:The potential of deep learning in clinical speech processing is immense, yet the hurdles of limited and imbalanced clinical data samples loom large. This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. Experiments involve checks on PVQD dataset, covering various causes of vocal system damage in English, and a Japanese dataset focusing on patients with Parkinson's disease before and after undergoing subthalamic nucleus deep brain stimulation (STN-DBS) surgery. The results on PVQD reveal a notable correlation (>0.8 on PCC) and an extraordinary accuracy (<0.5 on MSE) in predicting Grade, Breathy, and Asthenic indicators. Meanwhile, progress has been achieved in predicting the voice quality of patients in the context of STN-DBS.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to develop a method for evaluating the voice quality of patients with damaged vocal cord systems. Specifically, this research aims to use automatic speech recognition (ASR) representations and other multi - modal features to estimate the voice quality of patients, in order to overcome the limitations of traditional auditory - perceptual judgment methods. ### Main problems and challenges 1. **Limited and unbalanced clinical data samples**: Deep - learning models require a large amount of data to extract robust features, but clinical voice data are often limited and unbalanced. 2. **Limitations of subjective evaluation**: Traditional auditory - perceptual evaluation depends on experienced doctors or speech pathologists to evaluate sustained vowels and continuous speech. This method has the following problems: - It requires evaluators with extensive clinical experience. - In order to improve the reliability of the evaluation, the participation of multiple evaluators is usually required. - The evaluation cycle is long, which affects doctors' timely acquisition of results. 3. **Lack of objective evaluation means**: Existing objective evaluation methods mainly focus on sustained vowels, and the amount of data of these vowels is limited, which is not sufficient to support the effective training of deep - learning models. ### Solutions In order to solve the above problems, this research proposes a new method to evaluate the voice quality of patients by combining the following techniques: - **Automatic speech recognition (ASR) representation**: Use a pre - trained ASR model (such as Whisper) to extract voice features. - **Self - supervised learning (SSL) representation**: Use self - supervised pre - trained models such as HuBERT to extract voice features. - **Mel - spectrogram**: Capture the frequency characteristics of audio signals. This method can not only handle sustained vowels, but also continuous speech, thus improving the accuracy and robustness of the evaluation. The experimental results show that when predicting indicators such as Grade, Breathy and Asthenic, this method shows a significant correlation and high accuracy (PCC > 0.8, MSE < 0.5). In addition, this research also explored the changes in the voice quality of patients with Parkinson's disease before and after subthalamic nucleus deep brain stimulation (STN - DBS) surgery, further verifying the effectiveness of the proposed method. ### Conclusion This research provides a more objective and efficient clinical voice quality evaluation method by introducing ASR, SSL and Mel - spectrogram features. The experimental results show that this method performs better than traditional methods in multiple indicators, and can be applied to actual clinical scenarios. In particular, it shows preliminary predictive ability in evaluating the postoperative voice quality of patients with Parkinson's disease.

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

PDAssess: A Privacy-preserving Free-speech Based Parkinson's Disease Daily Assessment System

Voice disorder classification using speech enhancement and deep learning models

Deep Learning Application for Vocal Fold Disease Prediction Through Voice Recognition: A Preliminary Development Study (Preprint)

The cause of cirrhosis.

Advancing Voice Biometrics for Dysarthria Speakers Using Multitaper LFCC and Voice Conversion Data Augmentation

Non-Intrusive Speech Quality Assessment Based on Deep Neural Networks for Speech Communication

Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

Diagnosis of pathological speech with streamlined features for long short-term memory learning

Toward an Automatic Quality Assessment of Voice-Based Telemedicine Consultations: A Deep Learning Approach

Developing an Artificial Intelligence Tool to Predict Vocal Cord Pathology in Primary Care Settings

Leveraging Deep Learning for Fine-Grained Categorization of Parkinson's Disease Progression Levels through Analysis of Vocal Acoustic Patterns

Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context

Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson's Disease (A Comparative Study)

[Determination of circulatory outputs and volumes by means of injections of indicator: validation on models].

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Toward Real-World Voice Disorder Classification

Voice Analysis for Neurological Disorder Recognition-A Systematic Review and Perspective on Emerging Trends

Assessing clinical utility of Machine Learning and Artificial Intelligence approaches to analyze speech recordings in Multiple Sclerosis: A Pilot Study

A Novel Artificial-Intelligence-Based Approach for Classification of Parkinson’s Disease Using Complex and Large Vocal Features

Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection