Abstract:Summary Objectives The study aims to classify normal and pathological voices by leveraging the wav2vec 2.0 model as a feature extraction method in conjunction with machine learning classifiers. Methods Voice recordings were sourced from the publicly accessible VOICED database. The data underwent preprocessing, including normalization and data augmentation, before being input into the wav2vec 2.0 model for feature extraction. The extracted features were then used to train four machine learning models—Support Vector Machine (SVM), K-Nearest Neighbors, Decision Tree (DT), and Random Forest (RF)—which were evaluated using Stratified K-Fold cross-validation. Performance metrics such as accuracy, precision, recall, F1-score, macro average, micro average, receiver-operating characteristic (ROC) curve, and confusion matrix were utilized to assess model performance. Results The RF model achieved the highest accuracy (0.98 ± 0.02), alongside strong recall (0.97 ± 0.04), F1-score (0.95 ± 0.05), and consistently high area under the curve (AUC) values approaching 1.00, indicating superior classification performance. The DT model also demonstrated excellent performance, particularly in precision (0.97 ± 0.02) and F1-score (0.96 ± 0.02), with AUC values ranging from 0.86 to 1.00. Macro-averaged and micro-averaged analyses showed that the DT model provided the most balanced and consistent performance across all classes, while RF model exhibited robust performance across multiple metrics. Additionally, data augmentation significantly enhanced the performance of all models, with marked improvements in accuracy, recall, F1-score, and AUC values, especially notable in the RF and DT models. ROC curve analysis further confirms the consistency and reliability of the RF and SVM models across different folds, while confusion matrix analysis revealed that RF and SVM models had the fewest misclassifications in distinguishing "Normal" and "Pathological" samples. Consequently, RF and DT models emerged as the most robust performers, making them particularly well-suited for the voice classification task in this study. Conclusions The method of wav2vec 2.0 combining machine learning models proved highly effective in classifying normal and pathological voices, achieving exceptional accuracy and robustness across various machine evaluation metrics.

Classification of phonation types in singing voice using wavelet scattering network-based features

Automatic classification of neurological voice disorders using wavelet scattering features

Residual Attention Based Network for Automatic Classification of Phonation Modes

Analysis and Detection of Phonation Modes in Singing Voice using Excitation Source Features and Single Frequency Filtering Cepstral Coefficients (SFFCC)

Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

Employing Energy and Statistical Features for Automatic Diagnosis of Voice Disorders

Voice disorder classification using speech enhancement and deep learning models

A Voice Disease Detection Method Based on MFCCs and Shallow CNN

An ANN-based Method for Detecting Vocal Fold Pathology

Improved harmonic spectral envelope extraction for singer classification with hybridised model

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music

Machine-learning applied to classify flow-induced sound parameters from simulated human voice

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Dysphonic Voice Pattern Analysis of Patients in Parkinson's Disease Using Minimum Interclass Probability Risk Feature Selection and Bagging Ensemble Learning Methods.

Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction

Research On Singing Voice Detection Based On A Long-Term Recurrent Convolutional Network With Vocal Separation And Temporal Smoothing

Mechanical classification of voice quality

A Bag of Wavelet Features for Snore Sound Classification

Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature

Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech

Comparative study of respiratory sounds classification methods based on cepstral analysis and artificial neural networks