Abstract:One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson's disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the medical field, due to resource limitations in data collection, small medical data sets are difficult to effectively train machine - learning models, resulting in over - fitting and poor generalization ability. To solve this problem, the author proposes a novel method - **Aggregated Generation of Synthetic Data Points in Clustered Latent Space (AGCL)**, aiming to enhance the classification performance of small medical data sets by generating synthetic data. ### Detailed Explanation 1. **Background and Problem Description** - In many medical applications, it is very difficult to obtain a large amount of data, which leads to the problem of insufficient training data. - Small data sets are likely to cause machine - learning models to over - fit and have poor generalization ability. - Existing data augmentation methods (such as image rotation, mirroring, etc.) may not be suitable for very small data sets and may even have a negative impact on classification results. 2. **Proposed Method** - **AGCL Framework** - **Feature Extraction**: Extract relevant features from the original data. - **K - means Clustering**: Organize data points into multiple clusters \(C = \{c_1, c_2,\ldots, c_k\}\). - **Cluster Evaluation**: Evaluate the quality of clusters based on class separation measure (CSM) and entropy to decide whether further subdivision is required. - **Synthetic Data Generation**: Generate new synthetic data points using normal distribution according to the parameters of each cluster. 3. **Specific Steps** - **Feature Extraction and Clustering** - Extract features of data set \(X = \{x_1, x_2,\ldots, x_n\}\). - Use the K - means algorithm for clustering, with the goal of minimizing the sum of the distances from each point in the cluster to the centroid. - Each cluster \(c_j\) contains data points \(x_{i,j}\), where \(i = 1, 2,\ldots, n_j\), and \(n_j\) is the number of data points in cluster \(c_j\). - **Cluster Evaluation and Re - clustering** - Calculate the class separation measure (CSM): \[ CSM(c_j)=\frac{S_{\text{inter}}(c_j)}{C_{\text{intra}}(c_j)} \] where, \[ C_{\text{intra}}(c_j)=\sum_{l\in L_1}\frac{n_l(n_l - 1)}{2}\sum_{i,k\in l,i\neq k}\|x_i - x_k\| \] \[ S_{\text{inter}}(c_j)=\sum_{l_i,l_k\in L,l_i\neq l_k}\frac{1}{n_{l_i}\cdot n_{l_k}}\sum_{x_i\in l_i,x_k\in l_k}\|x_i - x_k\| \] - Calculate entropy \(H(c_j)\): \[ H(c_j)=-\sum_{l\in L}p(l)\log_2p(l) \] where \(p(l)\) is the probability of points belonging to class \(l\) in cluster \(c_j\). - Comprehensive Separation Criterion: \[ \text{Separation Criterion}(c_j)=\frac{CSM(c_j)}{H(c_j)} \] - If the separation criterion is lower than the threshold, the re - clustering process is triggered. - **Synthetic Data Generation**:

Artificial Data Point Generation in Clustered Latent Space for Small Medical Datasets

Generating Synthetic Mixed-Type Longitudinal Electronic Health Records for Artificial Intelligent Applications

Augmenting medical image classifiers with synthetic data from latent diffusion models

Improving classification results on a small medical dataset using a GAN; An outlook for dealing with rare disease datasets

Classification of clustered health care data analysis using generative adversarial networks (GAN)

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Efficient synthesis of 3D MR images for schizophrenia diagnosis classification with generative adversarial networks

How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound

A GMM Based Algorithm To Generate Point-Cloud And Its Application To Neuroimaging

Evaluating Synthetic Data Augmentation to Correct for Data Imbalance in Realistic Clinical Prediction Settings

Synthetic Event Time Series Health Data Generation

Hybrid Deep Learning Approach for Accurate Tumor Detection in Medical Imaging Data

Medical Image Synthesis for Data Augmentation and Anonymization using Generative Adversarial Networks

Cross-Modality Synthetic Data Augmentation using GANs: Enhancing Brain MRI and Chest X-ray Classification

Synthetic Generation of Patient Service Utilization Data: A Scalability Study

Generating Multi-label Discrete Patient Records using Generative Adversarial Networks

Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image Analysis

TSynD: Targeted Synthetic Data Generation for Enhanced Medical Image Classification

Exceptional performance with minimal data using a generative adversarial network for alzheimer's disease classification

Synthesising Electronic Health Records: Cystic Fibrosis Patient Group

Generating Synthetic Data for Medical Imaging