Clustering Algorithm Based on Dual-Index Nearest Neighbor Similarity Measure and Its Application in Gene Expression Data Analysis
zongjin li,ChangXin song,Jiyu Yang,Zeyu Jia,Chengying Yan,Liqin Tian,Xiaoming Wu
DOI: https://doi.org/10.21203/rs.3.rs-2641728/v1
2023-01-01
Abstract:Abstract Background The critical step in analyzing gene expression data is to divide genes into co-expression modules using module detection methods. Clustering algorithms are the most commonly employed technique for gene module detection. To obtain gene modules with great biological significance, the choice of an appropriate similarity measure methodology is vital. However, commonly used similarity measurement may not fully capture the complexities of biological systems. Hence, exploring more informative similarity measures before partitioning gene co-expression modules remains important. Results In this paper, we proposed a Dual-Index Nearest Neighbor Similarity Measure (DINNSM) algorithm to address the above issue. The algorithm first calculates the similarity matrix between genes using Pearson correlation or Spearman correlation. Then, nearest neighbor measurements are constructed based on the similarity matrix. Finally, the similarity matrix is reconstructed. We tested the six similarity measurement methods (Pearson correlation, Spearman correlation, Euclidean distance, maximum information coefficient, distance correlation, and DINNSM) by using four clustering algorithms: K-means, Hierarchical, FCM, and WGCNA on three independent gene expression datasets. The cluster evaluation was based on four indices: the Silhouette index, Calinski-Harabaz index, Adjust-Biological homogeneity index, and Davies-Bouldin index. The results showed that DINNSM is accurate and can get biologically meaningful gene co-expression modules. Conclusions DINNSM is better at revealing the complex biological relationships between genes and helps to obtain more accurate and biologically meaningful gene co-expression modules.