Evaluations on Several Imputation Approaches of Integrated Omics Data

董学思,林丽娟,赵杨,魏永越,戴俊程,陈峰
2017-01-01
Abstract:Objective In post-GWAS era,integrated data from various platforms has become increasingly popular.Because of the complexity of data sources,many new challenges arise,which inevitably include how to treat "block missing data".Ensuring the imputation accuracy and precision as well as maintain the variance-covariance structure of the original data is of great importance to missing data imputation.In this project,we aimed to evaluate the effect of several imputation methods based on both statistical techniques and machine learning techniques,on the integrated data from different data-platforms.Methods We go tlung cancer data-set (DNA methylation and gene expression) from The Cancer Genome Atlas (TCGA),and constructed missing data-set with different missing proportions at 5%,20%,35%,50% and 65%.The statistical methods (Mean imputation method,MCMC) and machine learning methods (kNN,MLP,RF) were applied.Evaluation indicators included estimation bias and matrix 2-norms.At last,we considered imputation time and finding out a time-saving and efficient method.Results MLP and kNN showed high quality imputation effect and less time consuming from different missing ratio.Mean imputation had shortest filling time,and the imputation quality was high when missing ratio was low (≤5 %).However,when missing ratio increasing,the imputation effect decreased.When the missing ratio increasing,RF and MCMC method exceled in Mean approach.Nevertheless,RF and MCMC were time-killer.Conclusion After comprehensive comparative analysis,MLP and kNN imputation from machine learning methods turned out to be suitable approaches in joint imputation process (DNA methylation,gene expression).
What problem does this paper attempt to address?