Abstract:Background: Auditory-perceptual assessment of voice is a subjective procedure. Artificial intelligence with deep learning (DL) may improve the consistency and accessibility of this task. It is unclear how a DL model performs on different acoustic features. Aims: To develop a generalizable DL framework for identifying dysphonia using a multidimensional acoustic feature. Methods & procedures: Recordings of sustained phonations of /a/ and /i/ were retrospectively collected from a clinical database. Subjects contained 238 dysphonic and 223 vocally healthy speakers of Chinese Mandarin. All audio clips were split into multiple 1.5-s segments and normalized to the same loudness level. Mel frequency cepstral coefficients and mel-spectrogram were extracted from these standardized segments. Each set of features was used in a convolutional neural network (CNN) to perform a binary classification task. The best feature was obtained through a five-fold cross-validation on a random selection of 80% data. The resultant DL framework was tested on the remaining 20% data and a public German voice database. The performance of the DL framework was compared with those of two baseline machine-learning models. Outcomes & results: The mel-spectrogram yielded the best model performance, with a mean area under the receiver operating characteristic curve of 0.972 and an accuracy of 92% in classifying audio segments. The resultant DL framework significantly outperformed both baseline models in detecting dysphonic subjects on both test sets. The best outcomes were achieved when classifications were made based on all segments of both vowels, with 95% accuracy, 92% recall, 98% precision and 98% specificity on the Chinese test set, and 92%, 95%, 90% and 89%, respectively, on the German set. Conclusions & implications: This study demonstrates the feasibility of DL for automatic detection of dysphonia. The mel-spectrogram is a preferred acoustic feature for the task. This framework may be used for vocal health screening and facilitate automatic perceptual evaluation of voice in the era of big data. What this paper adds: What is already known on this subject Auditory-perceptual assessment is the current gold standard in clinical evaluation of voice quality, but its value may be limited by the rater's reliability and accessibility. DL is a new method of artificial intelligence that can overcome these disadvantages and promote automatic voice assessment. This study explored the feasibility of a DL approach for automatic detection of dysphonia, along with a quantitative comparison of two common sets of acoustic features. What this study adds to existing knowledge A CNN model is excellent at decoding multidimensional acoustic features, outperforming the baseline parameter-based models in identifying dysphonic voices. The first 13 mel-frequency cepstral coefficients (MFCCs) are sufficient for this task. The mel-spectrogram results in greater performance, indicating the acoustic features are presented in a more favourable way than the MFCCs to the CNN model. What are the potential or actual clinical implications of this work? DL is a feasible method for the detection of dysphonia. The current DL framework may be used for remote vocal health screening or documenting voice recovery after treatment. In future, DL models may potentially be used to perform auditory-perceptual tasks in an automatic, efficient, reliable and low-cost manner.

Pre-trained models for detection and severity level classification of dysarthria from speech

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech

Variable STFT Layered CNN Model for Automated Dysarthria Detection and Severity Assessment Using Raw Speech

Classification of Dysarthria based on the Levels of Severity. A Systematic Review

Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

Automatic dysarthria detection and severity level assessment using CWT-layered CNN model

AFM signal model for dysarthric speech classification using speech biomarkers

A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms

Voice disorder classification using speech enhancement and deep learning models

Deep neural network architectures for dysarthric speech analysis and recognition

Leveraging Deep Learning for Fine-Grained Categorization of Parkinson's Disease Progression Levels through Analysis of Vocal Acoustic Patterns

Deep learning in automatic detection of dysphonia: Comparing acoustic features and developing a generalizable framework

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

Speech Intelligibility Classifiers from 550k Disordered Speech Samples

Artificial Intelligence‐Powered Acoustic Analysis System for Dysarthria Severity Assessment

Deep Learning and Artificial Intelligence Applied to Model Speech and Language in Parkinson's Disease

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

Feasibility Study of Parkinson's Speech Disorder Evaluation With Pre-Trained Deep Learning Model for Speech-to-Text Analysis