On Selecting Distance Metrics in $n$-Dimensional Normed Vector Spaces of Cells: A Novel Criterion and Similarity Measure Towards Efficient and Accurate Omics Analysis
Okezue Bell,Arthur Lee,Elizabeth Engle
DOI: https://doi.org/10.48550/arXiv.2306.09243
2024-06-05
Abstract:Single-cell omics enable the profiles of cells, which contain large numbers of biological features, to be quantified. Cluster analysis, a dimensionality reduction process, is used to reduce the dimensions of the data to make it computationally tractable. In these analyses, cells are represented as vectors in $n$-Dimensional space, where each dimension corresponds to a certain cell feature. The distance between cells is used as a surrogate measure of similarity, providing insight into the cell's state, function, and genetic mechanisms. However, as cell profiles are clustered in 3D or higher-dimensional space, it remains unknown which distance metric provides the most accurate spatiotemporal representation of similarity, limiting the interpretability of the data. I propose and prove a generalized proposition and set of corollaries that serve as a criterion to determine which of the standard distance measures is most accurate for conveying cell profile heterogeneity. Each distance method is evaluated via statistical, geometric, and topological proofs, which are formalized into a set of criteria. In this paper, I present the putative, first-ever method to elect the most accurate and precise distance metrics with any profiling modality, which are determined to be the Wasserstein distance and cosine similarity metrics, respectively, in general cases. I also identify special cases in which the criterion may select non-standard metrics. Combining the metric properties selected by the criterion, I develop a novel, custom, optimal distance metric that demonstrates superior computational efficiency, peak annotation, motif identification, and footprinting for transcription factor binding sites when compared with leading methods.
Genomics,Differential Geometry