iSIM-sigma: efficient standard deviation calculation for molecular similarity

Kenneth Lopez Perez,Bill Zhao,Ramon Alain Miranda Quintana
DOI: https://doi.org/10.1101/2024.11.24.625084
2024-11-26
Abstract:The average and variance of the molecular similarities in a set is high-value and useful information for cheminformatics tasks like chemical space exploration and subset selection. However, the calculation of the variance of the complete similarity matrix has a quadratic complexity, O(N^2). As the sizes of molecular libraries constantly increase, this pairwise approach is unfeasible. In this work, we present an alternative to obtaining the exact standard deviation of the molecular similarities in a set (with N molecules and M features) for the Russell-Rao (RR) and Sokal-Michener (SM) similarity indexes in O(NM^2) complexity. Additionally, we present a highly accurate approximation with linear complexity, O(N), based on the sampling of representative molecules from the set. The proposed approximation can be extended to other similarity indexes, including the popular Jaccard-Tanimoto (JT). With only the sampling of 50 molecules, the proposed method can estimate the standard deviation of the similarities in a set with RMSE lower than 0.01 for sets of up to 50,000 molecules. In comparison, random sampling does not warrant a good approximation as shown in our results.
Bioinformatics
What problem does this paper attempt to address?