DADApy: Distance-based Analysis of DAta-manifolds in Python

Aldo Glielmo,Iuri Macocco,Diego Doimo,Matteo Carli,Claudio Zeni,Romina Wild,Maria d'Errico,Alex Rodriguez,Alessandro Laio
DOI: https://doi.org/10.1016/j.patter.2022.100589
2022-09-20
Abstract:DADApy is a python software package for analysing and characterising high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in toy cases and in a real-world application. DADApy is freely available under the open-source Apache 2.0 license.
Machine Learning,Computational Physics
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to introduce and present a Python software package named **DADApy**, which is specifically designed for analyzing and characterizing high - dimensional data manifolds. Specifically, the paper attempts to solve the following key problems: 1. **Intrinsic Dimension Estimation**: - In high - dimensional data, the actually useful information usually exists on a low - dimensional manifold. DADApy provides multiple methods to estimate the intrinsic dimension of this low - dimensional manifold. For example, by using distance - based methods, such as the Two Nearest Neighbours (2NN) estimator, the intrinsic dimension of the data can be accurately estimated. 2. **Density Estimation**: - DADApy implements a non - parametric density estimation method, called Point - adaptive kNN (PAk), for reconstructing the probability density function \(\rho(x)\) from the data. This method is especially suitable for data embedded in low - dimensional manifolds and can significantly improve the estimation performance in complex scenarios. 3. **Density - based Clustering**: - This software package implements clustering algorithms based on Density Peaks (DP) and Advanced Density Peaks (ADP). These algorithms naturally divide the data set into different clusters by identifying the density peaks on the data manifold. ADP also introduces statistical significance analysis to automatically select the optimal density peaks as cluster centers. 4. **Metric Comparisons**: - In many applications, similarity or distance can be measured by different metrics. DADApy provides two methods to evaluate the relationship between different metrics: Neighbourhood Overlap and Information Imbalance. These methods can help users select the feature subset that is most suitable for describing the data manifold. ### Summary DADApy mainly targets some core challenges in high - dimensional data analysis, including how to effectively estimate the intrinsic dimension of data, reconstruct the probability density, perform density - based clustering, and compare different distance metrics. These functions make DADApy a powerful tool for handling complex high - dimensional data, especially having broad application prospects in fields such as computational science and biomedicine. ### Example Application Scenarios - **Synthetic Data Set**: The paper shows the application of DADApy on a synthetic data set with a complex topology. This data set consists of a two - dimensional plane twisted into a Möbius strip and is embedded in a 50 - dimensional noise space. The results show that DADApy can accurately estimate the intrinsic dimension, reconstruct the density, and identify the correct clusters. - **Real - World Application**: The paper also shows the application of DADApy in analyzing biomolecular trajectories, further proving its effectiveness in actual data processing. Through these methods, DADApy provides researchers with a powerful and flexible tool that can deeply mine the hidden structures and patterns in high - dimensional data.