Automatic cross‐ and multi‐lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

Dosti Aziz,Dávid Sztahó
DOI: https://doi.org/10.1111/exsy.13660
IF: 3.3
2024-06-14
Expert Systems
Abstract:Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross‐ and multi‐lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x‐vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x‐vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language‐independent manner, forming fixed‐dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x‐vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross‐lingual and multi‐lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language‐independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross‐ and multi‐lingual scenarios.
computer science, artificial intelligence, theory & methods
What problem does this paper attempt to address?