Functional Validation of Transcription Factor to Gene Interactions by Statistical Learning of Gaussian Bayesian networks from SNP and Expression data .
Jing Xiang
Abstract:Understanding gene regulation is an important step to understanding how essential mechanisms are controlled in biological systems. One of the central goals of biology is to identify which transcription factors (TFs) regulate the transcription of which target genes and what are their downstream effects. Functional assays such as ChIP-seq and DNase I together can provide a TF binding map of TF binding sites on DNA. However, the binding alone may not result in changing the target gene expression. Thus, functional validation is necessary to show that the binding influences target gene expression. The standard approach to functional validation is to perform artificial TF knockdown experiments and declare the differentially expressed genes as validated target genes. Instead of artificial perturbation, we propose to leverage the naturally-occurring genetic variations as the source of perturbations that vary gene expressions and to analyze population SNP and geneexpression data in order to validate the TF binding map. Compared to the standard approach that perturbs TF concentration for a single TF at a time, our approach is potentially more powerful, because any aspects of the TF-target interaction, including TF concentration and TF binding affinity, can be perturbed by a large number of SNPs found across the genome. In addition, we are able to leverage existing SNP and gene expression data, which is available from the popular expression quantitative trait locus mapping studies. We introduce a statistical approach, based on conditional Gaussian Bayesian networks, that integrates population SNP and gene expression data with TF binding data to validate the TF binding map. We develop an efficient learning algorithm for learning the gene regulatory network by using the TF binding data as prior knowledge and selecting the TF-target interactions that are validated based on population SNP and gene-expression data. Given the estimated network, we perform inference on the estimated probabilistic graphical models to determine downstream genes that are affected by the TF-target interactions. We demonstrate our method on ENCODE ChIP-seq and DNase I data, and on population SNP and expression data from lymphoblastoid cells, originally collected for the 1000 Genomes and HapMap 3 projects respectively. Finally, we apply our approach to validate the TF binding map of ER and its coregulators in breast cancer using ENCODE ChIP-seq and DNase I data, and population SNP and expression data from the TCGA project.
Computer Science,Biology
What problem does this paper attempt to address?