Quantity versus Diversity: Influence of Data on Detecting EEG Pathology with Advanced ML Models

Martyna Poziomska,Marian Dovgialo,Przemysław Olbratowski,Paweł Niedbalski,Paweł Ogniewski,Joanna Zych,Jacek Rogala,Jarosław Żygierewicz

2024-11-14

Abstract:This study investigates the impact of quantity and diversity of data on the performance of various machine-learning models for detecting general EEG pathology. We utilized an EEG dataset of 2,993 recordings from Temple University Hospital and a dataset of 55,787 recordings from Elmiko Biosignals sp. z o.o. The latter contains data from 39 hospitals and a diverse patient set with varied conditions. Thus, we introduce the Elmiko dataset - the largest publicly available EEG corpus. Our findings show that small and consistent datasets enable a wide range of models to achieve high accuracy; however, variations in pathological conditions, recording protocols, and labeling standards lead to significant performance degradation. Nonetheless, increasing the number of available recordings improves predictive accuracy and may even compensate for data diversity, particularly in neural networks based on attention mechanism or transformer architecture. A meta-model that combined these networks with a gradient-boosting approach using handcrafted features demonstrated superior performance across varied datasets.

Signal Processing,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to explore the impact of the quantity and diversity of data on the performance of machine - learning models in detecting electroencephalogram (EEG) pathological states. Specifically, the research aims to answer the following key questions: 1. **Relationship between data quantity and model performance**: Does increasing the amount of data significantly improve the model's prediction accuracy? How do different types of machine - learning models perform in the face of large amounts of data? 2. **Relationship between data diversity and model generalization ability**: Can data sets from multiple hospitals, different patient conditions, and diverse recording protocols help the model generalize better to new, unseen clinical scenarios? 3. **Performance differences between complex and simple models**: With changes in data quantity and diversity, are there significant differences in performance between complex neural networks (such as networks based on attention mechanisms or Transformer architectures) and simple classical models (such as random forests and support vector machines)? 4. **Challenges and opportunities in multi - source data training**: Although multi - source data (multi - cohort learning) increases data diversity, it also brings the problem of heterogeneity. The research attempts to reveal the specific impact of these data characteristics on model performance and explore effective solutions. To answer these questions, the authors used two different data sets for experiments: - **TUH data set**: A relatively small and homogeneous data set, containing 2,993 records, all from a single institution. - **ELM 19 data set**: A large and heterogeneous data set, containing 55,787 records, from 39 different hospitals, covering a variety of patient conditions and recording protocols. By comparing the model performance on these two data sets, the research hopes to identify the most effective strategies for developing robust EEG classifiers that can work reliably in different clinical environments. In addition, the research also explores the role of data diversity in the medical field, providing valuable insights for the future design of ML architectures and the collection of medical data sets.

Quantity versus Diversity: Influence of Data on Detecting EEG Pathology with Advanced ML Models

Assisting Schizophrenia Diagnosis Using Clinical Electroencephalography and Interpretable Graph Neural Networks: a Real-World and Cross-Site Study

Data-driven retrieval of population-level EEG features and their role in neurodegenerative diseases

Psychiatric disorders from EEG signals through deep learning models

Differentiating Ischemic Stroke Patients from Healthy Subjects Using a Large-Scale, Retrospective EEG Database and Machine Learning Methods

A Lightweight Multi-Mental Disorders Detection Method Using Entropy-Based Matrix from Single-Channel EEG Signals

EEG-based Signatures of Schizophrenia, Depression, and Aberrant Aging: A Supervised Machine Learning Investigation

Machine learning of brain-specific biomarkers from EEG

Data leakage in deep learning studies of translational EEG

Stabilizing Subject Transfer in EEG Classification with Divergence Estimation

Precise Discrimination for Multiple Etiologies of Dementia Cases Based on Deep Learning with Electroencephalography

Comparative Analysis of Epileptic Seizure Prediction: Exploring Diverse Pre-Processing Techniques and Machine Learning Models

Epileptic Seizure Detection Using Machine Learning Techniques

Characterizing the heterogeneity of neurodegenerative diseases through EEG normative modeling

Machine Learning-Based Detection of Parkinson's Disease From Resting-State EEG: A Multi-Center Study

Automatic diagnostics of electroencephalography pathology based on multi-domain feature fusion

Leveraging Multiple Distinct EEG Training Sessions for Improvement of Spectral-Based Biometric Verification Results

Analysis of the impact of deep learning know-how and data in modelling neonatal EEG

Machine Learning Approaches for Detecting Parkinson’s Disease from EEG Analysis: A Systematic Review

Schizophrenia diagnosis based on diverse epoch size resting-state EEG using machine learning

The Dependence of Machine Learning on Electronic Medical Record Quality