Evaluating the Impact of Data Augmentation on Predictive Model Performance

Valdemar Švábenský,Conrad Borchers,Elizabeth B. Cloude,Atsushi Shimada
DOI: https://doi.org/10.1145/3706468.3706485
2024-12-03
Abstract:In supervised machine learning (SML) research, large training datasets are essential for valid results. However, obtaining primary data in learning analytics (LA) is challenging. Data augmentation can address this by expanding and diversifying data, though its use in LA remains underexplored. This paper systematically compares data augmentation techniques and their impact on prediction performance in a typical LA task: prediction of academic outcomes. Augmentation is demonstrated on four SML models, which we successfully replicated from a previous LAK study based on AUC values. Among 21 augmentation techniques, SMOTE-ENN sampling performed the best, improving the average AUC by 0.01 and approximately halving the training time compared to the baseline models. In addition, we compared 99 combinations of chaining 21 techniques, and found minor, although statistically significant, improvements across models when adding noise to SMOTE-ENN (+0.014). Notably, some augmentation techniques significantly lowered predictive performance or increased performance fluctuation related to random chance. This paper's contribution is twofold. Primarily, our empirical findings show that sampling techniques provide the most statistically reliable performance improvements for LA applications of SML, and are computationally more efficient than deep generation methods with complex hyperparameter settings. Second, the LA community may benefit from validating a recent study through independent replication.
Machine Learning,Computers and Society
What problem does this paper attempt to address?
This paper attempts to solve two main problems: 1. **The impact of data augmentation on the performance of prediction models**: - In the field of Learning Analytics (LA), the performance of Supervised Machine Learning (SML) models depends on high - quality and large - scale training data. However, obtaining sufficient original student data faces many challenges in practice, such as high time cost, sample homogeneity, privacy protection, etc. - Data Augmentation techniques can address these challenges by expanding and diversifying data, but its application in LA is still insufficient. Therefore, this study aims to systematically evaluate the impact of different data augmentation techniques on the performance of prediction models in LA tasks. 2. **The reproducibility and replicability of research**: - In LA and related fields of research, few published studies have been successfully replicated by independent teams. This has led to doubts about the validity and generalization ability of existing research results. - In order to improve the credibility of research, this paper selects a previous LAK conference study for independent replication and applies data augmentation techniques on this basis to verify whether these techniques can significantly improve model performance. ### Specific research questions - **RQ1: To what extent can we replicate the analysis and results of previous learning analytics research?** - By selecting and replicating a published LA study, verify the reliability of its methods and results. - **RQ2: When data augmentation techniques are applied to the original training data, which techniques and to what extent can they improve model performance?** - Evaluate the impact of 21 different data augmentation techniques and their combinations on the prediction performance of four SML models (Logistic Regression, Support Vector Machine, Random Forest, Multi - Layer Perceptron). ### Research contributions - **Empirical findings**: The study shows that sampling techniques (such as SMOTE - ENN) provide the most reliable and statistically significant performance improvement in LA applications and are more computationally efficient than complex generation methods. - **Methodological contributions**: Provide a set of methods for systematically evaluating data augmentation techniques and share practical suggestions and lessons learned to help other researchers apply these techniques in their educational models. - **Promote research transparency**: By independently replicating existing research and verifying its validity, it provides a reliable baseline for future LA research. ### Conclusion This paper, by combining data augmentation and research replication, not only improves the performance of prediction models but also enhances the credibility and reproducibility of research, thus making an important contribution to the development of the LA field.