Abstract:Summary Objectives The study aims to classify normal and pathological voices by leveraging the wav2vec 2.0 model as a feature extraction method in conjunction with machine learning classifiers. Methods Voice recordings were sourced from the publicly accessible VOICED database. The data underwent preprocessing, including normalization and data augmentation, before being input into the wav2vec 2.0 model for feature extraction. The extracted features were then used to train four machine learning models—Support Vector Machine (SVM), K-Nearest Neighbors, Decision Tree (DT), and Random Forest (RF)—which were evaluated using Stratified K-Fold cross-validation. Performance metrics such as accuracy, precision, recall, F1-score, macro average, micro average, receiver-operating characteristic (ROC) curve, and confusion matrix were utilized to assess model performance. Results The RF model achieved the highest accuracy (0.98 ± 0.02), alongside strong recall (0.97 ± 0.04), F1-score (0.95 ± 0.05), and consistently high area under the curve (AUC) values approaching 1.00, indicating superior classification performance. The DT model also demonstrated excellent performance, particularly in precision (0.97 ± 0.02) and F1-score (0.96 ± 0.02), with AUC values ranging from 0.86 to 1.00. Macro-averaged and micro-averaged analyses showed that the DT model provided the most balanced and consistent performance across all classes, while RF model exhibited robust performance across multiple metrics. Additionally, data augmentation significantly enhanced the performance of all models, with marked improvements in accuracy, recall, F1-score, and AUC values, especially notable in the RF and DT models. ROC curve analysis further confirms the consistency and reliability of the RF and SVM models across different folds, while confusion matrix analysis revealed that RF and SVM models had the fewest misclassifications in distinguishing "Normal" and "Pathological" samples. Consequently, RF and DT models emerged as the most robust performers, making them particularly well-suited for the voice classification task in this study. Conclusions The method of wav2vec 2.0 combining machine learning models proved highly effective in classifying normal and pathological voices, achieving exceptional accuracy and robustness across various machine evaluation metrics.

Pathological Voice Classification Using Multiresolution Time Series Classification Network

Voice Pathology Detection and Classification Using Convolutional Neural Network Model

Automatic Respiratory Sound Classification Via Multi-Branch Temporal Convolutional Network

A Voice Disease Detection Method Based on MFCCs and Shallow CNN

Pathological voice detection using optimized deep residual neural network and explainable artificial intelligence

Voice disorder classification using speech enhancement and deep learning models

Diagnosis of pathological speech with streamlined features for long short-term memory learning

Attentive-based Multi-level Feature Fusion for Voice Disorder Diagnosis

Improving Pathological Voice Detection: A Weakly Supervised Learning Method

Pathological Voice Feature Selection Based on Neural Network

A hybrid model for pathological voice recognition of post-stroke dysarthria by using 1DCNN and double-LSTM networks

Multi-class Detection of Pathological Speech with Latent Features: How does it perform on unseen data?

Voice disorder classification using convolutional neural network based on deep transfer learning

Multifeature Fusion Method with Metaheuristic Optimization for Automated Voice Pathology Detection

Post-Stroke Dysarthria Voice Recognition based on Fusion Feature MSA and 1D

Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction

Multitask and Transfer Learning Approach for Joint Classification and Severity Estimation of Dysphonia

Transfer Learning Models for Detecting Six Categories of Phonocardiogram Recordings

Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset

Voice pathology detection using optimized convolutional neural networks and explainable artificial intelligence-based analysis

Automatic Classification of Normal–Abnormal Heart Sounds Using Convolution Neural Network and Long-Short Term Memory