What problem does this paper attempt to address?

The problems that this paper attempts to solve are: in high - dimensional, unstructured data sets, how to define and measure class density and data set quality, and explore the relationship between these metrics and classification accuracy. Specifically, the author hopes: 1. **Define class density**: Provide a method to measure the aggregated similarity of samples in each class in high - dimensional, unstructured data sets. To this end, the author proposes a method to calculate class density in the three - dimensional space after UMAP dimensionality reduction. 2. **Evaluate data set quality**: Propose a method to measure the quality of high - dimensional, unstructured data sets, with particular attention to the integrity of the data set. The author found through experiments that when the quality of the data set exceeds a certain threshold (for example, 10), redundant data can be deleted based on class density without significantly affecting classification accuracy. 3. **Study the relationship between class density and classification accuracy**: Analyze the correlation between the density values obtained by different class density calculation methods and the test accuracy of the corresponding classes to determine whether there is a trend of higher density leading to higher accuracy. ### Specific problem description - **Definition and calculation of data density**: Since distance measurement in high - dimensional data is prone to produce approximately uniform distance distributions (i.e., "curse of dimensionality"), the author uses UMAP to reduce high - dimensional image data to three dimensions and defines class density on this basis. Class density is defined as the aggregated similarity of samples in each class and is calculated by three candidate methods: minimum, maximum, and mean standard deviation. \[ d_{\text{min}}^i = n \cdot c_i \left( \sum_{j = 1}^n c_j \right)^{-1} \cdot \min(\sigma_i)^{-1} \] \[ d_{\text{max}}^i = n \cdot c_i \left( \sum_{j = 1}^n c_j \right)^{-1} \cdot \max(\sigma_i)^{-1} \] \[ d_{\text{mean}}^i = n \cdot c_i \left( \sum_{j = 1}^n c_j \right)^{-1} \cdot \left( \frac{1}{m} \sum_{k = 1}^m \sigma_{ik} \right)^{-1} \] - **Definition of data set quality**: Data set quality is defined as the integrity of the data set, that is, whether the classification accuracy remains unchanged or almost unchanged after a certain amount of data is deleted. The author found through experiments that for 6 data sets, when the quality is greater than 10, the density can be reduced to 1.0 and the accuracy that is not statistically significantly different can be maintained. - **Dynamic data reduction experiment**: The author gradually reduces the number of samples in the training data set through a dynamic data reduction strategy to reach the target density value and evaluates the classification accuracy after the reduction. The experimental results show that on multiple data sets, the amount of training data can be significantly reduced without affecting the classification performance. ### Conclusion Through experiments on different data sets, the author found: - The mean standard deviation method (Equation 3) performs best in most cases, with the highest average correlation and no negative correlation. - For most data sets, the amount of training data can be significantly reduced without affecting the classification accuracy. - The quality of the data set is closely related to its compressibility. High - quality data sets can improve efficiency by deleting redundant data without affecting performance. In summary, this paper aims to provide new perspectives and tools for the processing of high - dimensional, unstructured data by defining class density and data set quality, thereby optimizing the use efficiency of data sets and improving the training effect of models.

Class Density and Dataset Quality in High-Dimensional, Unstructured Data

Assessing Data Quality Within Available Context

Density Peak Clustering with connectivity estimation

Comparative Density Peaks Clustering

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

What is the Value of Data? On Mathematical Methods for Data Quality Estimation

Demass: A New Density Estimator for Big Data

A survey on dataset quality in machine learning

Density Estimation Based on Mass

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

Statistical Inference from High Dimensional Data

Exploring Dataset-Scale Indicators of Data Quality

Density-ratio Based Clustering for Discovering Clusters with Varying Densities.

Classification with many classes: challenges and pluses

Degree-heterogeneous Latent Class Analysis for High-dimensional Discrete Data

Class Granularity: How richly does your knowledge graph represent the real world?

A Density-based Under-sampling Algorithm for Imbalance Classification

A hybrid imbalanced classification model based on data density

On the Accurate Estimation of Information-Theoretic Quantities from Multi-Dimensional Sample Data

Sparse clusterability: testing for cluster structure in high dimensions

High dimensionality: The latest challenge to data analysis