Abstract:Purpose: Many studies using machine learning (ML) in speech, language, and hearing sciences rely upon cross-validations with single data splitting. This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust data splitting method of nested k -fold cross-validation. The second purpose is to present methods and MATLAB code to perform power analysis for ML-based analysis during the design of a study. Method: First, the significant impact of different cross-validations on ML outcomes was demonstrated using real-world clinical data. Then, Monte Carlo simulations were used to quantify the interactions among the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, the dimensionality of the model, and the sample size. Four different cross-validation methods (single holdout, 10-fold, train–validation–test, and nested 10-fold) were compared based on the statistical power and confidence of the resulting ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome (5% significance) with 80% power. Statistical confidence of the model was defined as the probability of correct features being selected for inclusion in the final model. Results: ML models generated based on the single holdout method had very low statistical power and confidence, leading to overestimation of classification accuracy. Conversely, the nested 10-fold cross-validation method resulted in the highest statistical confidence and power while also providing an unbiased estimate of accuracy. The required sample size using the single holdout method could be 50% higher than what would be needed if nested k -fold cross-validation were used. Statistical confidence in the model based on nested k -fold cross-validation was as much as four times higher than the confidence obtained with the single holdout–based model. A computational model, MATLAB code, and lookup tables are provided to assist researchers with estimating the minimum sample size needed during study design. Conclusion: The adoption of nested k -fold cross-validation is critical for unbiased and robust ML studies in the speech, language, and hearing sciences. Supplemental Material: https://doi.org/10.23641/asha.25237045

72 Developing Machine Learning Models When Data is Limiting

On How Data Are Partitioned in Model Development and Evaluation: Confronting the Elephant in the Room to Enhance Model Generalization.

When not to use machine learning: A perspective on potential and limitations

Frequent Errors in Modeling by Machine Learning: A Prototype Case of Predicting the Timely Evolution of COVID-19 Pandemic

Robust Machine Learning by Transforming and Augmenting Imperfect Training Data

Interpretable and predictive models based on high-dimensional data in ecology and evolution

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Synthetic data at scale: a development model to efficiently leverage machine learning in agriculture

Machine Learning in Environmental Research: Common Pitfalls and Best Practices.

Quality of Data in Machine Learning

A novel algorithm can generate data to train machine learning models in conditions of extreme scarcity of real world data

Clinical prediction models and the multiverse of madness

Strategies for overcoming data scarcity, imbalance, and feature selection challenges in machine learning models for predictive maintenance

Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance

Learning from Limited and Imperfect Data

Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting

Machine learning models and over-fitting considerations

The challenges of using machine learning models in psychiatric research and clinical practice

Scaling Laws for the Value of Individual Data Points in Machine Learning

Why Machine Learning Models Systematically Underestimate Extreme Values

The Data Addition Dilemma