ClusterGraph: a new tool for visualization and compression of multidimensional data

Paweł Dłotko,Davide Gurnari,Mathis Hallier,Anna Jurek-Loughrey
2024-11-08
Abstract:Understanding the global organization of complicated and high dimensional data is of primary interest for many branches of applied sciences. It is typically achieved by applying dimensionality reduction techniques mapping the considered data into lower dimensional space. This family of methods, while preserving local structures and features, often misses the global structure of the dataset. Clustering techniques are another class of methods operating on the data in the ambient space. They group together points that are similar according to a fixed similarity criteria, however unlike dimensionality reduction techniques, they do not provide information about the global organization of the data. Leveraging ideas from Topological Data Analysis, in this paper we provide an additional layer on the output of any clustering algorithm. Such data structure, ClusterGraph, provides information about the global layout of clusters, obtained from the considered clustering algorithm. Appropriate measures are provided to assess the quality and usefulness of the obtained representation. Subsequently the ClusterGraph, possibly with an appropriate structure--preserving simplification, can be visualized and used in synergy with state of the art exploratory data analysis techniques.
Computational Geometry,Algebraic Topology,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of understanding the global organizational structure of high - dimensional data. Specifically, the paper proposes a new tool - ClusterGraph for the visualization and compression of high - dimensional data. The following are the specific problems that the paper attempts to solve: 1. **Global organizational structure of high - dimensional data**: - Although existing dimensionality reduction techniques (such as PCA, t - SNE, UMAP, etc.) can preserve local structures and features, they often overlook the global structure of the data set. - Clustering techniques (such as K - means, DBSCAN, etc.) can group similar data points, but they cannot provide organizational information about points within or between clusters, and thus cannot evaluate the global structure of the data either. 2. **Limitations of dimensionality reduction techniques**: - Many high - dimensional data sets cannot be embedded in a low - dimensional Euclidean space without distortion. For example, the distance information between far - away points in some data sets will be distorted after dimensionality reduction. - ClusterGraph avoids this distortion by calculating the distances between clusters in the original high - dimensional space and representing these distances in the form of edge labels. 3. **Supplementing the global structure of clustering results**: - The paper proposes ClusterGraph as an additional layer to the output of any clustering algorithm, providing information about the global layout between clusters. - ClusterGraph can not only show the connection relationships between clusters, but also maintain the structural characteristics of the data through appropriate simplification methods, so that it can be used in synergy with existing exploratory data analysis techniques. 4. **Evaluating and optimizing the quality of ClusterGraph**: - A method based on metric distortion is proposed to evaluate the quality of ClusterGraph. - The logarithmic ratio is used as a quality metric, and a smaller value indicates a better representation of ClusterGraph. - The ClusterGraph is further optimized through an edge - pruning algorithm to better approximate the intrinsic structure of the data. ### Summary The core problem of the paper is to develop a new tool - ClusterGraph that can弥补 the deficiencies of existing dimensionality reduction techniques and clustering techniques, in order to more comprehensively understand and visualize the global organizational structure of high - dimensional data. ClusterGraph constructs a graph structure based on the clustering results and introduces the concept of metric distortion to evaluate and optimize its quality, thus providing a new perspective and method for high - dimensional data analysis.