Abstract:Background: Dementia, marked by cognitive decline, is a global health challenge. Alzheimer's disease (AD), the leading type, accounts for ~70% of cases. Electroencephalography (EEG) measures show promise in identifying AD risk, but obtaining large samples for reliable comparisons is challenging. Objective: This study integrates signal processing, harmonization, and statistical techniques to enhance sample size and improve AD risk classification reliability. Methods: We used advanced EEG preprocessing, feature extraction, harmonization, and propensity score matching (PSM) to balance healthy non-carriers (HC) and asymptomatic E280A mutation carriers (ACr). Data from four databases were harmonized to adjust site effects while preserving covariates like age and sex. PSM ratios (2:1, 5:1, 10:1) were applied to assess sample size impact on model performance. The final dataset underwent machine learning analysis with decision trees and cross-validation for robust results. Results: Balancing sample sizes via PSM significantly improved classification accuracy, ranging from 0.92 to 0.96 across ratios. This approach enabled precise risk identification even with limited samples. Conclusion: Integrating data processing, harmonization, and balancing techniques improves AD risk classification accuracy, offering potential for other neurodegenerative diseases.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the problem of insufficient sample size in Alzheimer's disease (AD) risk classification. Specifically, the paper proposes a comprehensive method to increase the sample size through the integration of signal processing, data homogenization and statistical techniques, thereby improving the reliability of the Alzheimer's disease risk classification model. The paper pays special attention to how to balance the sample proportion between healthy non - carriers (HC) and asymptomatic E280A - mutant Alzheimer's disease carriers (ACr) to optimize the model performance. ### Main research objectives: 1. **Increase the sample size**: Integrate data from multiple databases and use the propensity score matching (PSM) technique to balance the sample proportions of different groups. 2. **Improve model performance**: Improve the accuracy of Alzheimer's disease risk classification through machine - learning methods, especially the decision - tree model. 3. **Data homogenization**: When integrating data from multiple databases, control the site effect while retaining important covariate effects such as age and gender. ### Research methods: - **Data sources**: Obtain data from four different databases (UdeA 1, UdeA 2, SRM, CHBMP), each of which contains specific subject groups and data collection protocols. - **Data pre - processing**: Use the standardized EEG pre - processing pipeline (PREP), including steps such as signal detrending, robust reference, bad - channel interpolation, and independent component analysis (ICA). - **Feature extraction**: Extract multiple features, including relative power, entropy, coherence, cross - frequency relationships, and synchronization likelihood. - **Data homogenization**: Use the ComBat algorithm in the neuroHarmonize package for data homogenization to control the site effect. - **Propensity score matching (PSM)**: Calculate the propensity score through logistic regression, match the covariates between the HC and ACr groups, and reduce bias. - **Model selection and evaluation**: Use the decision - tree model for classification and evaluate the model performance through cross - validation. ### Research results: - **Classification accuracy**: Under different sample proportions (2:1, 5:1, 10:1), the classification accuracy of the model is significantly improved, with the accuracy rate ranging from 0.92 to 0.96. - **Feature importance**: Evaluate the importance of features through Cohen's d value, and find that some features have significant differences in distinguishing between the HC and ACr groups. - **Model stability**: At the 10:1 ratio, the model shows the highest stability and accuracy. ### Conclusion: This study improves the reliability and accuracy of the Alzheimer's disease risk classification model through a comprehensive method, especially in the case of limited sample size. This method is not only applicable to Alzheimer's disease research, but can also be extended to the risk classification of other neurodegenerative diseases.

Comprehensive Methodology for Sample Augmentation in EEG Biomarker Studies for Alzheimers Risk Classification

Analysis of Risk Factors in Dementia Through Machine Learning

Combining EEG signal processing with supervised methods for Alzheimer’s patients classification

Integrative EEG biomarkers predict progression to Alzheimer's disease at the MCI stage

A Self-driven Approach For Multi-class Discrimination In Alzheimer's Disease Based On Wearable EEG.

Neural Biomarker Diagnosis and Prediction to Mild Cognitive Impairment and Alzheimer’s Disease Using EEG Technology

Using EEG, SPECT, and Multivariate Resampling Methods to Differentiate Between Alzheimer's and other Cognitive Impairments

Fully Automated Discrimination of Alzheimer's Disease Using Resting-State Electroencephalography Signals.

Using Multi-Scale Genetic, Neuroimaging and Clinical Data for Predicting Alzheimer’s Disease and Reconstruction of Relevant Biological Mechanisms

Classifying Alzheimers Disease and Dementia Patients Using Non-invasive EEG Biomarkers

Utilizing portable electroencephalography to screen for pathology of Alzheimer's disease: a methodological advancement in diagnosis of neurodegenerative diseases

Assessing the Potential of Data Augmentation in EEG Functional Connectivity for Early Detection of Alzheimer’s Disease

Balancing Spectral, Temporal and Spatial Information for EEG-based Alzheimer's Disease Classification

Assessing the Potential of Data Augmentation in EEG Functional Connectivity for Early Detection of Alzheimer's Disease

P3-287: Composite Cognitive Endpoints with Improved Power to Detect Presymptomatic Alzheimer's Disease Treatment Effects: Findings in the Colombian Kindred with the E280A Presenilin 1 Mutation and the Alzheimer's Prevention Initiative

Comparative analysis of machine learning algorithms for Alzheimer's disease classification using EEG signals and genetic information

Evaluating the reliability of neurocognitive biomarkers of neurodegenerative diseases across countries: A machine learning approach

Synthetic data analysis for early detection of Alzheimer progression through machine learning algorithms

Towards improving Alzheimer's intervention: a machine learning approach for biomarker detection through combining MEG and MRI pipelines

Early dementia diagnosis, MCI‐to‐dementia risk prediction, and the role of machine learning methods for feature extraction from integrated biomarkers, in particular for EEG signal analysis

TMS-EEG perturbation biomarkers for Alzheimer’s disease patients classification