Abstract:One of the growing trends in machine learning is the use of data generation techniques, since the performance of machine learning models is dependent on the quantity of the training dataset. However, in many medical applications, collecting large datasets is challenging due to resource constraints, which leads to overfitting and poor generalization. This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL), designed to enhance classification performance on small medical datasets through synthetic data generation. The AGCL framework involves feature extraction, K-means clustering, cluster evaluation based on a class separation metric, and the generation of synthetic data points from clusters with distinct class representations. This method was applied to Parkinson's disease screening, utilizing facial expression data, and evaluated across multiple machine learning classifiers. Experimental results demonstrate that AGCL significantly improves classification accuracy compared to baseline, GN and kNNMTD. AGCL achieved the highest overall test accuracy of 83.33% and cross-validation accuracy of 90.90% in majority voting over different emotions, confirming its effectiveness in augmenting small datasets.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the medical field, due to resource limitations in data collection, small medical data sets are difficult to effectively train machine - learning models, resulting in over - fitting and poor generalization ability. To solve this problem, the author proposes a novel method - **Aggregated Generation of Synthetic Data Points in Clustered Latent Space (AGCL)**, aiming to enhance the classification performance of small medical data sets by generating synthetic data.
### Detailed Explanation
1. **Background and Problem Description**
- In many medical applications, it is very difficult to obtain a large amount of data, which leads to the problem of insufficient training data.
- Small data sets are likely to cause machine - learning models to over - fit and have poor generalization ability.
- Existing data augmentation methods (such as image rotation, mirroring, etc.) may not be suitable for very small data sets and may even have a negative impact on classification results.
2. **Proposed Method**
- **AGCL Framework**
- **Feature Extraction**: Extract relevant features from the original data.
- **K - means Clustering**: Organize data points into multiple clusters \(C = \{c_1, c_2,\ldots, c_k\}\).
- **Cluster Evaluation**: Evaluate the quality of clusters based on class separation measure (CSM) and entropy to decide whether further subdivision is required.
- **Synthetic Data Generation**: Generate new synthetic data points using normal distribution according to the parameters of each cluster.
3. **Specific Steps**
- **Feature Extraction and Clustering**
- Extract features of data set \(X = \{x_1, x_2,\ldots, x_n\}\).
- Use the K - means algorithm for clustering, with the goal of minimizing the sum of the distances from each point in the cluster to the centroid.
- Each cluster \(c_j\) contains data points \(x_{i,j}\), where \(i = 1, 2,\ldots, n_j\), and \(n_j\) is the number of data points in cluster \(c_j\).
- **Cluster Evaluation and Re - clustering**
- Calculate the class separation measure (CSM):
\[
CSM(c_j)=\frac{S_{\text{inter}}(c_j)}{C_{\text{intra}}(c_j)}
\]
where,
\[
C_{\text{intra}}(c_j)=\sum_{l\in L_1}\frac{n_l(n_l - 1)}{2}\sum_{i,k\in l,i\neq k}\|x_i - x_k\|
\]
\[
S_{\text{inter}}(c_j)=\sum_{l_i,l_k\in L,l_i\neq l_k}\frac{1}{n_{l_i}\cdot n_{l_k}}\sum_{x_i\in l_i,x_k\in l_k}\|x_i - x_k\|
\]
- Calculate entropy \(H(c_j)\):
\[
H(c_j)=-\sum_{l\in L}p(l)\log_2p(l)
\]
where \(p(l)\) is the probability of points belonging to class \(l\) in cluster \(c_j\).
- Comprehensive Separation Criterion:
\[
\text{Separation Criterion}(c_j)=\frac{CSM(c_j)}{H(c_j)}
\]
- If the separation criterion is lower than the threshold, the re - clustering process is triggered.
- **Synthetic Data Generation**: