Comprehensive Methodology for Sample Augmentation in EEG Biomarker Studies for Alzheimers Risk Classification

Veronica Henao Isaza,David Aguillon,Carlos Andres Tobon Quintero,Francisco Lopera,John Fredy Ochoa Gomez
2024-11-20
Abstract:Background: Dementia, marked by cognitive decline, is a global health challenge. Alzheimer's disease (AD), the leading type, accounts for ~70% of cases. Electroencephalography (EEG) measures show promise in identifying AD risk, but obtaining large samples for reliable comparisons is challenging. Objective: This study integrates signal processing, harmonization, and statistical techniques to enhance sample size and improve AD risk classification reliability. Methods: We used advanced EEG preprocessing, feature extraction, harmonization, and propensity score matching (PSM) to balance healthy non-carriers (HC) and asymptomatic E280A mutation carriers (ACr). Data from four databases were harmonized to adjust site effects while preserving covariates like age and sex. PSM ratios (2:1, 5:1, 10:1) were applied to assess sample size impact on model performance. The final dataset underwent machine learning analysis with decision trees and cross-validation for robust results. Results: Balancing sample sizes via PSM significantly improved classification accuracy, ranging from 0.92 to 0.96 across ratios. This approach enabled precise risk identification even with limited samples. Conclusion: Integrating data processing, harmonization, and balancing techniques improves AD risk classification accuracy, offering potential for other neurodegenerative diseases.
Signal Processing,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the problem of insufficient sample size in Alzheimer's disease (AD) risk classification. Specifically, the paper proposes a comprehensive method to increase the sample size through the integration of signal processing, data homogenization and statistical techniques, thereby improving the reliability of the Alzheimer's disease risk classification model. The paper pays special attention to how to balance the sample proportion between healthy non - carriers (HC) and asymptomatic E280A - mutant Alzheimer's disease carriers (ACr) to optimize the model performance. ### Main research objectives: 1. **Increase the sample size**: Integrate data from multiple databases and use the propensity score matching (PSM) technique to balance the sample proportions of different groups. 2. **Improve model performance**: Improve the accuracy of Alzheimer's disease risk classification through machine - learning methods, especially the decision - tree model. 3. **Data homogenization**: When integrating data from multiple databases, control the site effect while retaining important covariate effects such as age and gender. ### Research methods: - **Data sources**: Obtain data from four different databases (UdeA 1, UdeA 2, SRM, CHBMP), each of which contains specific subject groups and data collection protocols. - **Data pre - processing**: Use the standardized EEG pre - processing pipeline (PREP), including steps such as signal detrending, robust reference, bad - channel interpolation, and independent component analysis (ICA). - **Feature extraction**: Extract multiple features, including relative power, entropy, coherence, cross - frequency relationships, and synchronization likelihood. - **Data homogenization**: Use the ComBat algorithm in the neuroHarmonize package for data homogenization to control the site effect. - **Propensity score matching (PSM)**: Calculate the propensity score through logistic regression, match the covariates between the HC and ACr groups, and reduce bias. - **Model selection and evaluation**: Use the decision - tree model for classification and evaluate the model performance through cross - validation. ### Research results: - **Classification accuracy**: Under different sample proportions (2:1, 5:1, 10:1), the classification accuracy of the model is significantly improved, with the accuracy rate ranging from 0.92 to 0.96. - **Feature importance**: Evaluate the importance of features through Cohen's d value, and find that some features have significant differences in distinguishing between the HC and ACr groups. - **Model stability**: At the 10:1 ratio, the model shows the highest stability and accuracy. ### Conclusion: This study improves the reliability and accuracy of the Alzheimer's disease risk classification model through a comprehensive method, especially in the case of limited sample size. This method is not only applicable to Alzheimer's disease research, but can also be extended to the risk classification of other neurodegenerative diseases.