Topological Quality of Subsets via Persistence Matching Diagrams

Álvaro Torras-Casas,Eduardo Paluzo-Hidalgo,Rocio Gonzalez-Diaz
2024-09-29
Abstract:Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
Algebraic Topology,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is how to evaluate the quality of a subset (e.g., a training dataset) relative to the entire dataset (e.g., the complete dataset), particularly in terms of training, generalization, and performance of machine learning models. The authors propose a method based on topological data analysis techniques, specifically by defining a persistence matching diagram, to measure the quality of the subset. This method not only determines whether the subset "well" represents the clusters in the larger dataset but also estimates the bounds of the Hausdorff distance between the subset and the complete dataset. This helps explain why certain subsets may lead to poorer performance in supervised learning models. Specifically, the main contributions of the paper include: 1. **Definition of the Persistence Matching Diagram**: This is a topological invariant derived from the combination of embeddings and persistent homology, used to assess the representativeness of the subset. 2. **Algorithm Implementation**: An algorithm is provided for computing the persistence matching diagram using a minimum spanning tree. 3. **Quality Assessment**: Through the persistence matching diagram, it can be determined whether the subset "well" represents the clusters in the larger dataset and estimate the Hausdorff distance between the subset and the complete dataset. 4. **Application Examples**: Two experiments are presented to demonstrate the application of this method in real machine learning problems. These contributions help improve the understanding of training data quality, thereby optimizing the performance of machine learning models.