Abstract:Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross‐ and multi‐lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x‐vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x‐vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language‐independent manner, forming fixed‐dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x‐vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross‐lingual and multi‐lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language‐independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross‐ and multi‐lingual scenarios.

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

Fusing linguistic and acoustic information for automated forensic speaker comparison

Transferring Audio Deepfake Detection Capability Across Languages

Automatic cross‐ and multi‐lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

Embedding Aggregation for Forensic Facial Comparison

Impact of Naturalistic Field Acoustic Environments on Forensic Text-independent Speaker Verification System

Validation of an ECAPA-TDNN system for Forensic Automatic Speaker Recognition under case work conditions

Cross-lingual Speaker Verification with Deep Feature Learning.

Investigating cross-lingual training for offensive language detection

Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition

Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency

Investigating the Impact of Cross-lingual Acoustic-Phonetic Similarities on Multilingual Speech Recognition

Low-resource Accent Classification in Geographically-proximate Settings: A Forensic and Sociophonetics Perspective

Deep Learning for Speaker Identification: Architectural Insights from AB-1 Corpus Analysis and Performance Evaluation

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information

Triplet loss based embeddings for forensic speaker identification in Spanish

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

Comparative Analysis of Multilingual Text Classification & Identification through Deep Learning and Embedding Visualization

Improved disentangled speech representations using contrastive learning in factorized hierarchical variational autoencoder

Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling