Pathological voice detection using optimized deep residual neural network and explainable artificial intelligence

Roohum Jegan,R. Jayagowri
DOI: https://doi.org/10.1007/s11042-024-20348-y
IF: 2.577
2024-10-09
Multimedia Tools and Applications
Abstract:Voice disorders affect individuals' vocal quality and communication abilities, which pose significant challenges for both individuals and healthcare providers. The accurate and timely detection of voice disorders is crucial in facilitating early intervention and effective treatment. This study proposes a new noninvasive approach for voice disorder detection based on an optimized deep residual neural network. Input speech samples are transformed into mel-spectrogram time-frequency images and applied to train the ResNet-50 transfer learning model. The spectrogram time-frequency representation effectively captures intricate patterns and features that might indicate the presence of voice disorders exploiting local and global characteristics. Four hyperparameters of the ResNet-50 model are optimized using the snake optimization algorithm, which delivers an optimum residual deep transfer learning (DTL) model with an enhanced voice pathology detection rate. The proposed snake-optimized ResNet-50 model is evaluated on four popular voice pathology datasets: AVPD, SVD, PdA and VOICED. The results demonstrate the efficacy of the optimized ResNet-50 framework in accurately classifying healthy and pathological voice samples with 98.13% accuracy. Comparisons with recent machine learning and deep learning models reveal the superiority of the proposed approach in terms of F1-score, sensitivity, specificity and accuracy. Finally, Gradient-weighted class activation mapping (Grad-CAM) explainable artificial intelligence (XAI) is utilized for visualizing and interpreting the decision-making process.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?