Learning Statistical and Geometric Models from Microarray Gene Expression Data

Yangyong Zhu
2009-01-01
Abstract:Analysis of microarray gene expression data is important for disease study at the molecular and genomic level. Computational data modeling and analysis are essential for extracting meaningful and specific information from noisy, high-throughput, and large-scale microarray gene expression data. In this dissertation, we propose and develop innovative data modeling and analysis methods for learning statistical and geometric models from gene expression data and subsequently discover data structure and information associated with disease mechanisms. To provide a high-level overview of gene expression data for easy and insightful understanding of data structure relevant to the physiological event of interest, we propose a novel statistical data clustering and visualization algorithm that is comprehensive and effective for multiple clustering tasks and that overcomes some of the major limitations associated with existing clustering methods. The proposed clustering and visualization algorithm performs progressive, divisive hierarchical clustering and visualization, supported by hierarchical statistical modeling, supervised/unsupervised informative gene/feature selection, supervised/unsupervised data visualization, and user/prior knowledge guidance through humandata interactions, to discover cluster structure within complex, high-dimensional gene expression data. Applications to muscular dystrophy, muscle regeneration, and cancer data demonstrated its abilities to identify functionally enriched (co-regulated) gene groups, detect/validate disease types/subtypes, and discover the pathological relationship among multiple disease types reflected by gene expression profiles. For the purpose of selecting suitable clustering algorithm(s) for gene expression data analysis and validating the advantage of our proposed clustering algorithm, we design an
What problem does this paper attempt to address?