Probabilistic Lung Cancer Models Conditioned on Gene Expression Microarray Data

Craig Friedman,Wenbo Cao,Cheng Fan
DOI: https://doi.org/10.1007/0-387-23077-7_11
2005-01-01
Abstract:A number of quantitative methods have been applied to the classification and clustering of microarray data (see, for example, [Tibshirani et al., 2001]). In this article, we describe a statistical learning theory-based method to construct lung cancer probability models that are conditioned on gene expression microarray data. Our models do more than classify-they indicate an estimate of the probability. We find our estimate for the conditional probability distribution by choosing a model that balances consistency with the training data and consistency with a prior distribution. This formulation leads to an optimization problem that has a mathematically equivalent problem with an objective function that is a penalized log-likelihood. We discuss three particular estimation problems: 1) find the conditional probability that a sample is adenocarcinoma or normal, given gene expression levels, 2) find the conditional probability for each of six disjoint categories related to lung cancer, given gene expression levels, and 3) find the conditional probability distribution for survival time, given gene expression levels. We describe the features that we select and measure the performance of the models that we create in economic terms. For the conditional probability of adenocarcinoma, we condition on probeset identifiers common to both the Harvard and Michigan data sets. When we trained on either data set, we were able to nearly perfectly classify adenocarcinoma on the other set.
What problem does this paper attempt to address?