Abstract:Machine Learning (ML) algorithms have demonstrated remarkable performance in dysphonia detection using speech samples. However, their efficacy often diminishes when tested on languages different from the training data, raising questions about their suitability in clinical settings. This study aims to develop a robust method for cross‐ and multi‐lingual dysphonia detection that overcomes the limitation of language dependency in existing ML methods. We propose an innovative approach that leverages speech embeddings from speaker verification models, especially ECAPA and x‐vector and employs a majority voting ensemble classifier. We utilize speech features extracted from ECAPA and x‐vector embeddings to train three distinct classifiers. The significant advantage of these embedding models lies in their capability to capture speaker characteristics in a language‐independent manner, forming fixed‐dimensional feature spaces. Additionally, we investigate the impact of generating synthetic data within the embedding feature space using the Synthetic Minority Oversampling Technique (SMOTE). Our experimental results unveil the effectiveness of the proposed method for dysphonia detection. Compared to results obtained from x‐vector embeddings, ECAPA consistently demonstrates superior performance in distinguishing between healthy and dysphonic speech, achieving accuracy values of 93.33% and 96.55% in both cross‐lingual and multi‐lingual scenarios, respectively. This highlights the remarkable capabilities of speaker verification models, especially ECAPA, in capturing language‐independent features that enhance overall detection performance. The proposed method effectively addresses the challenges of language dependency in dysphonia detection. ECAPA embeddings, combined with majority voting ensemble classifiers, show significant potential for improving the accuracy and reliability of dysphonia detection in cross‐ and multi‐lingual scenarios.

Singer Identification Using Convolutional Acoustic Motif Embeddings

An Attentional Neural Network Architecture for Folk Song Classification

A Deep Learning Based Analysis-Synthesis Framework For Unison Singing

Singer Identification for Metaverse with Timbral and Middle-Level Perceptual Features

Computational lexical analysis of Flamenco genres

Singer Identity Representation Learning using Self-Supervised Techniques

A Novel Framework for Efficient Automated Singer Identification in Large Music Databases

Deep Conditional Representation Learning for Drum Sample Retrieval by Vocalisation

Singer identification model using data augmentation and enhanced feature conversion with hybrid feature vector and machine learning

From Real to Cloned Singer Identification

Decoding Vocal Articulations from Acoustic Latent Representations

Boosting the Predictive Accurary of Singer Identification Using Discrete Wavelet Transform For Feature Extraction

Are Music Foundation Models Better at Singing Voice Deepfake Detection? Far-Better Fuse them with Speech Foundation Models

Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds

Polyphonic pitch detection with convolutional recurrent neural networks

Automatic cross‐ and multi‐lingual recognition of dysphonia by ensemble classification using deep speaker embedding models

Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting

Contextual Joint Factor Acoustic Embeddings

Triplet loss based embeddings for forensic speaker identification in Spanish

Improved harmonic spectral envelope extraction for singer classification with hybridised model