Topological Quality of Subsets via Persistence Matching Diagrams

Álvaro Torras-Casas,Eduardo Paluzo-Hidalgo,Rocio Gonzalez-Diaz

2024-09-29

Abstract:Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.

Algebraic Topology,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problem this paper attempts to address is how to evaluate the quality of a subset (e.g., a training dataset) relative to the entire dataset (e.g., the complete dataset), particularly in terms of training, generalization, and performance of machine learning models. The authors propose a method based on topological data analysis techniques, specifically by defining a persistence matching diagram, to measure the quality of the subset. This method not only determines whether the subset "well" represents the clusters in the larger dataset but also estimates the bounds of the Hausdorff distance between the subset and the complete dataset. This helps explain why certain subsets may lead to poorer performance in supervised learning models. Specifically, the main contributions of the paper include: 1. **Definition of the Persistence Matching Diagram**: This is a topological invariant derived from the combination of embeddings and persistent homology, used to assess the representativeness of the subset. 2. **Algorithm Implementation**: An algorithm is provided for computing the persistence matching diagram using a minimum spanning tree. 3. **Quality Assessment**: Through the persistence matching diagram, it can be determined whether the subset "well" represents the clusters in the larger dataset and estimate the Hausdorff distance between the subset and the complete dataset. 4. **Application Examples**: Two experiments are presented to demonstrate the application of this method in real machine learning problems. These contributions help improve the understanding of training data quality, thereby optimizing the performance of machine learning models.

Topological Quality of Subsets via Persistence Matching Diagrams

Topological Machine Learning with Persistence Indicator Functions

Perturbation Robust Representations of Topological Persistence Diagrams

Optimal rates of convergence for persistence diagrams in Topological Data Analysis

Discrete transforms of quantized persistence diagrams

Topological and metric properties of spaces of generalized persistence diagrams

Properties and Stability of Persistence Matching Diagrams

Visualizing Topological Importance: A Class-Driven Approach

Learning Persistent Homology of 3D Point Clouds

Understanding the Topology and the Geometry of the Space of Persistence Diagrams via Optimal Partial Transport

Computational Topology and Its Applications in Geometric Design

Persistent interaction topology in data analysis

Persistent Topological Features in Large Language Models

A Comparative Study of Machine Learning Methods for Persistence Diagrams

Wasserstein convergence of Čech persistence diagrams for samplings of submanifolds

Topo-Geometric Analysis of Variability in Point Clouds using Persistence Landscapes

Towards a Persistence Diagram That is Robust to Noise and Varied Densities.

A Class of Topological Pseudodistances for Fast Comparison of Persistence Diagrams

Comparing representations of high-dimensional data with persistent homology: a case study in neuroimaging

The Accumulated Persistence Function, a New Useful Functional Summary Statistic for Topological Data Analysis, With a View to Brain Artery Trees and Spatial Point Process Applications

TDAvec: Computing Vector Summaries of Persistence Diagrams for Topological Data Analysis in R and Python