Abstract:Summary Objectives The study aims to classify normal and pathological voices by leveraging the wav2vec 2.0 model as a feature extraction method in conjunction with machine learning classifiers. Methods Voice recordings were sourced from the publicly accessible VOICED database. The data underwent preprocessing, including normalization and data augmentation, before being input into the wav2vec 2.0 model for feature extraction. The extracted features were then used to train four machine learning models—Support Vector Machine (SVM), K-Nearest Neighbors, Decision Tree (DT), and Random Forest (RF)—which were evaluated using Stratified K-Fold cross-validation. Performance metrics such as accuracy, precision, recall, F1-score, macro average, micro average, receiver-operating characteristic (ROC) curve, and confusion matrix were utilized to assess model performance. Results The RF model achieved the highest accuracy (0.98 ± 0.02), alongside strong recall (0.97 ± 0.04), F1-score (0.95 ± 0.05), and consistently high area under the curve (AUC) values approaching 1.00, indicating superior classification performance. The DT model also demonstrated excellent performance, particularly in precision (0.97 ± 0.02) and F1-score (0.96 ± 0.02), with AUC values ranging from 0.86 to 1.00. Macro-averaged and micro-averaged analyses showed that the DT model provided the most balanced and consistent performance across all classes, while RF model exhibited robust performance across multiple metrics. Additionally, data augmentation significantly enhanced the performance of all models, with marked improvements in accuracy, recall, F1-score, and AUC values, especially notable in the RF and DT models. ROC curve analysis further confirms the consistency and reliability of the RF and SVM models across different folds, while confusion matrix analysis revealed that RF and SVM models had the fewest misclassifications in distinguishing "Normal" and "Pathological" samples. Consequently, RF and DT models emerged as the most robust performers, making them particularly well-suited for the voice classification task in this study. Conclusions The method of wav2vec 2.0 combining machine learning models proved highly effective in classifying normal and pathological voices, achieving exceptional accuracy and robustness across various machine evaluation metrics.

Application of Hierarchical Clustering Analysis for Vocal Feature Extraction

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Data-Driven Decision-Support System for Speaker Identification Using E-Vector System

Auditory model-based speech feature extraction and its application to speaker identification

Auditory Model Based Speech Feature Extraction and Its Application to Speaker Identification

Multi-resolution Time Frequency Feature and Complementary Combination for Short Utterance Speaker Recognition

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Audio–Visual Deep Clustering for Speech Separation

Speaker Segmentation and Clustering Based on the Improved Spectral Clustering

A Novel Information Integration Algorithm for Speech Recognition System: Basing on Adaptive Clustering and Supervised State of Acoustic Feature

Identification of Speaker from Disguised Voice Using MFCC Feature Extraction, Chi-Square and Classification Technique

Hierarchical Support Vector Machines for Audio Classification

Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction

Maximum Margin Clustering Based Statistical VAD with Multiple Observation Compound Feature.

A Novel I-Vector Framework Using Multiple Features and PCA for Speaker Recognition in Short Speech Condition

MFCC in audio signal processing for voice disorder: a review

Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition

Variant Time-Frequency Cepstral Features for Speaker Recognition

Analysis of Multimedia Feature Extraction Technology in College Vocal Performance Teaching Mode Based on Multimodal Multimedia Information

Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering

Deep neural networks based speaker modeling at different levels of phonetic granularity