Abstract:We provide a definition for class density that can be used to measure the aggregate similarity of the samples within each of the classes in a high-dimensional, unstructured dataset. We then put forth several candidate methods for calculating class density and analyze the correlation between the values each method produces with the corresponding individual class test accuracies achieved on a trained model. Additionally, we propose a definition for dataset quality for high-dimensional, unstructured data and show that those datasets that met a certain quality threshold (experimentally demonstrated to be > 10 for the datasets studied) were candidates for eliding redundant data based on the individual class densities.
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: in high - dimensional, unstructured data sets, how to define and measure class density and data set quality, and explore the relationship between these metrics and classification accuracy. Specifically, the author hopes:
1. **Define class density**: Provide a method to measure the aggregated similarity of samples in each class in high - dimensional, unstructured data sets. To this end, the author proposes a method to calculate class density in the three - dimensional space after UMAP dimensionality reduction.
2. **Evaluate data set quality**: Propose a method to measure the quality of high - dimensional, unstructured data sets, with particular attention to the integrity of the data set. The author found through experiments that when the quality of the data set exceeds a certain threshold (for example, 10), redundant data can be deleted based on class density without significantly affecting classification accuracy.
3. **Study the relationship between class density and classification accuracy**: Analyze the correlation between the density values obtained by different class density calculation methods and the test accuracy of the corresponding classes to determine whether there is a trend of higher density leading to higher accuracy.
### Specific problem description
- **Definition and calculation of data density**: Since distance measurement in high - dimensional data is prone to produce approximately uniform distance distributions (i.e., "curse of dimensionality"), the author uses UMAP to reduce high - dimensional image data to three dimensions and defines class density on this basis. Class density is defined as the aggregated similarity of samples in each class and is calculated by three candidate methods: minimum, maximum, and mean standard deviation.
\[
d_{\text{min}}^i = n \cdot c_i \left( \sum_{j = 1}^n c_j \right)^{-1} \cdot \min(\sigma_i)^{-1}
\]
\[
d_{\text{max}}^i = n \cdot c_i \left( \sum_{j = 1}^n c_j \right)^{-1} \cdot \max(\sigma_i)^{-1}
\]
\[
d_{\text{mean}}^i = n \cdot c_i \left( \sum_{j = 1}^n c_j \right)^{-1} \cdot \left( \frac{1}{m} \sum_{k = 1}^m \sigma_{ik} \right)^{-1}
\]
- **Definition of data set quality**: Data set quality is defined as the integrity of the data set, that is, whether the classification accuracy remains unchanged or almost unchanged after a certain amount of data is deleted. The author found through experiments that for 6 data sets, when the quality is greater than 10, the density can be reduced to 1.0 and the accuracy that is not statistically significantly different can be maintained.
- **Dynamic data reduction experiment**: The author gradually reduces the number of samples in the training data set through a dynamic data reduction strategy to reach the target density value and evaluates the classification accuracy after the reduction. The experimental results show that on multiple data sets, the amount of training data can be significantly reduced without affecting the classification performance.
### Conclusion
Through experiments on different data sets, the author found:
- The mean standard deviation method (Equation 3) performs best in most cases, with the highest average correlation and no negative correlation.
- For most data sets, the amount of training data can be significantly reduced without affecting the classification accuracy.
- The quality of the data set is closely related to its compressibility. High - quality data sets can improve efficiency by deleting redundant data without affecting performance.
In summary, this paper aims to provide new perspectives and tools for the processing of high - dimensional, unstructured data by defining class density and data set quality, thereby optimizing the use efficiency of data sets and improving the training effect of models.