Tran-DSR: A hybrid model for dysarthric speech recognition using transformer encoder and ensemble learning

Rabbia Mahum,Ahmed M. El-Sherbeeny,Khaled Alkhaledi,Haseeb Hassan
DOI: https://doi.org/10.1016/j.apacoust.2024.110019
IF: 3.614
2024-04-29
Applied Acoustics
Abstract:Over the last decade, there has been a notable increase in the pervasiveness of neurological diseases due to population growth and aging. Among individuals with conditions like stroke, Parkinson's disease, cerebral palsy, and other neurological indications, dysarthria commonly manifests. Dysarthria can have a detrimental impact on functional communication, often leading to significantly reduced quality of life; also, ineffective communication has been behind significant health and safety incidents and interferes with people's productivity. Timely detecting and treating dysarthria in these patients is crucial to effectively manage their disease progression. Failure to do so can lead to challenges in disease management and may have adverse effects on the patient's psychological and physiological well-being as symptoms worsen. Many previous studies focused on dysarthria speech detection by employing several machine learning or deep learning techniques as classification tools. In this work, we propose a hybrid model, namely Tran-DSR, in which the strengths of ensemble deep networks and the Transformer Encoder scheme are combined. Ensemble learning as a backbone plays a crucial role in extracting powerful features from the mel-spectrograms. Two scenarios are considered: ensemble 1 (E1), which includes VGG16, DenseNet201, and GoogleNet, and ensemble 2 (E2), comprising InceptionResNetV2, DenseNet201, and Xception. On the other hand, the Transformer Encoder is constructed utilizing the self-attention approach, which allows the network to focus on relevant information, along with a multilayer perceptron for precise speech recognition. By leveraging this hybrid approach, accurate and efficient disease identification can be achieved. Experimental outcomes demonstrate that the Tran-DSR model achieves the highest accuracy of 99.18%, surpassing the performance of other research models.
acoustics
What problem does this paper attempt to address?