Usage of Clustering and Weighted Nearest Neighbors for Efficient Missing Data Imputation of Microarray Gene Expression Dataset

Aditya Dubey,Akhtar Rasool
DOI: https://doi.org/10.1002/adts.202200460
2022-08-30
Advanced Theory and Simulations
Abstract:This research provides a technique for predicting missing values by using clustering and top K nearest neighbor techniques that consider the local similarity. After optimizing the clustering parameters, cluster size, and weighting criteria, missed gene sample values are estimated. The top K nearest neighbor method uses weighted distance to predict the missed gene sample value falling in a specific cluster. A complete dataset is essential for most bioinformatics analytical techniques, including gene expression data categorization, prognosis, and prediction. Due to sensor malfunction, software inability, or human error, the gene sample value may be missing. In gene expression experiments, missing data has a massive effect on analyzing the data obtained. Consequently, this has become a crucial issue requiring an efficient imputation technique to address. This research provided a technique for predicting missing values by using clustering and top K closest neighbor techniques that consider the local similarity pattern. The K‐means method is integrated with a spectral clustering methodology. After optimizing the clustering parameters, cluster size, and weighting criteria, missed gene sample values are estimated. The top K closest neighbor method uses weighted distance to predict the missed gene sample value falling in a specific cluster. Experimental outcomes show that the suggested imputation methodology generates efficient predictions compared to existing imputation techniques. In this research, microarray datasets comprising information from various cancers and tumors are used to experiment with the imputation performance. The primary contribution of this work is that even if the microarray dataset has varied dimensions and features, local similarity‐based approaches may be employed for missing value prediction.
What problem does this paper attempt to address?