Abstract:With the recent development of speech-enabled interactive systems using artificial agents, there has been substantial interest in the analysis and classification of voice disorders to provide more inclusive systems for people living with specific speech and language impairments. In this paper, a two-stage framework is proposed to perform an accurate classification of diverse voice pathologies. The first stage consists of speech enhancement processing based on the original premise, which considers impaired voice as a noisy signal. To put this hypothesis into practice, the noise lestral harmonic-to-noise ratio (CHNR). The second stage consists of a convolutional neural network with long short-term memory (CNN-LSTM) architecture designed to learn complex features from spectrograms of the first-stage enhanced signals. A new sinusoidal rectified unit (SinRU) is proposed to be used as an activation function by the CNN-LSTM network. The experiments are carried out by using two subsets of the Saarbruecken voice database (SVD) with different etiologies covering eight pathologies. The first subset contains voice recordings of patients with vocal cordectomy, psychogenic dysphonia, pachydermia laryngis and frontolateral partial laryngectomy, and the second subset contains voice recordings of patients with vocal fold polyp, chronic laryngitis, functional dysphonia, and vocal cord paresis. Dysarthria severity levels identification in Nemours and Torgo databases is also carried out. The experimental results showed that using the minimum mean square error (MMSE)-based signal enhancer prior to the CNN-LSTM network using SinRU, led to a significant improvement in the automatic classification of the investigated voice disorders and dysarhtria severity levels. These findings support the hypothesis that using an appropriate speech enhancement preprocessing has positive effects on the accuracy of the automatic classification of voice pathologies thanks to the reduction of the intrinsic noise induced by the voice impairment.

An Attention Long Short-Term Memory based system for automatic classification of speech intelligibility

On combining acoustic and modulation spectrograms in an attention LSTM-based system for speech intelligibility level classification

Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Non-Intrusive Speech Intelligibility Prediction for Hearing-Impaired Users using Intermediate ASR Features and Human Memory Models

Towards Efficient Recurrent Architectures: A Deep LSTM Neural Network Applied to Speech Enhancement and Recognition

A deep learning approach to dysarthric utterance classification with BiLSTM-GRU, speech cue filtering, and log mel spectrograms

Multiple-target Deep Learning for LSTM-RNN Based Speech Enhancement

Prediction of speech intelligibility with DNN-based performance measures

Non-Intrusive Speech Intelligibility Prediction for Hearing Aids using Whisper and Metadata

MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

An Attention-Based Neural Network Approach For Single Channel Speech Enhancement

Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models

Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners

Voice disorder classification using speech enhancement and deep learning models

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

Speech Emotion Classification Using Attention-Based LSTM

A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions.

Deep neural network architectures for dysarthric speech analysis and recognition

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion