Abstract:With the recent development of speech-enabled interactive systems using artificial agents, there has been substantial interest in the analysis and classification of voice disorders to provide more inclusive systems for people living with specific speech and language impairments. In this paper, a two-stage framework is proposed to perform an accurate classification of diverse voice pathologies. The first stage consists of speech enhancement processing based on the original premise, which considers impaired voice as a noisy signal. To put this hypothesis into practice, the noise lestral harmonic-to-noise ratio (CHNR). The second stage consists of a convolutional neural network with long short-term memory (CNN-LSTM) architecture designed to learn complex features from spectrograms of the first-stage enhanced signals. A new sinusoidal rectified unit (SinRU) is proposed to be used as an activation function by the CNN-LSTM network. The experiments are carried out by using two subsets of the Saarbruecken voice database (SVD) with different etiologies covering eight pathologies. The first subset contains voice recordings of patients with vocal cordectomy, psychogenic dysphonia, pachydermia laryngis and frontolateral partial laryngectomy, and the second subset contains voice recordings of patients with vocal fold polyp, chronic laryngitis, functional dysphonia, and vocal cord paresis. Dysarthria severity levels identification in Nemours and Torgo databases is also carried out. The experimental results showed that using the minimum mean square error (MMSE)-based signal enhancer prior to the CNN-LSTM network using SinRU, led to a significant improvement in the automatic classification of the investigated voice disorders and dysarhtria severity levels. These findings support the hypothesis that using an appropriate speech enhancement preprocessing has positive effects on the accuracy of the automatic classification of voice pathologies thanks to the reduction of the intrinsic noise induced by the voice impairment.

Variable STFT Layered CNN Model for Automated Dysarthria Detection and Severity Assessment Using Raw Speech

Automatic dysarthria detection and severity level assessment using CWT-layered CNN model

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Pre-trained models for detection and severity level classification of dysarthria from speech

Deep neural network architectures for dysarthric speech analysis and recognition

A store-and-forward cloud-based telemonitoring system for automatic assessing dysarthria evolution in neurological diseases from video-recording analysis

Automatic Assessment of Dysarthria Using Audio-visual Vowel Graph Attention Network

A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms

Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning

AFM signal model for dysarthric speech classification using speech biomarkers

Residual Convolutional Neural Network-Based Dysarthric Speech Recognition

An Interpretable and Generalizable Speech Detector Based on a CNN-LSTM Framework

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Deep Learning and Artificial Intelligence Applied to Model Speech and Language in Parkinson's Disease

Advancing Voice Biometrics for Dysarthria Speakers Using Multitaper LFCC and Voice Conversion Data Augmentation

Voice disorder classification using speech enhancement and deep learning models

Articulatory Features for ASR of Pathological Speech

Feasibility Study of Parkinson's Speech Disorder Evaluation With Pre-Trained Deep Learning Model for Speech-to-Text Analysis

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Deep CNN for Parkinson's Disease Classification Using Line Spectral Frequency Images of Sustained Speech Phonation

Empowering Dysarthric Speech: Leveraging Advanced LLMs for Accurate Speech Correction and Multimodal Emotion Analysis