Abstract:Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross‐ and multi‐lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x‐vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x‐vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language‐independent manner, forming fixed‐dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x‐vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross‐lingual and multi‐lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language‐independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross‐ and multi‐lingual scenarios.

Enhancing speaker verification accuracy with deep ensemble learning and inclusion of multifaceted demographic factors

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

End-to-End Feature Learning for Text-Independent Speaker Verification

Speaker Verification using Convolutional Neural Networks

Voxceleb: Large-scale speaker verification in the wild

Modified layer deep convolution neural network for text-independent speaker recognition

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

Automatic cross‐ and multi‐lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

A focus module-based lightweight end-to-end CNN framework for voiceprint recognition

Few-shot short utterance speaker verification using meta-learning

CACRN-Net: A 3D log Mel spectrogram based channel attention convolutional recurrent neural network for few-shot speaker identification

Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Improving Speaker Verification Performance Against Long-Term Speaker Variability

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms