Ensemble Machine Methods for Analysis of Transcription Factor and DNA Interactions

Yue Fan,Mark Kon,Charles DeLisi
2010-01-01
Abstract:Motivation: The network of interactions between transcription factors and their regulatory gene targets governs many of the behaviors and responses of cells. The construction of regulatory network have been decomposed to identify, for every known regulator, its target genes, its binding motif and its DNA binding sites. Many tools have been developed in the last decade to solve these problems. However, Tompa et al. (2005) showed that the performance of individual algorithm was not constantly good for all transcription factors. Because machine learning algorithms have shown advantage in integrating information of different types, we exploit this property in the integration of predictions from an ensemble of commonly used motif exploration algorithms. Results: We selected In this paper, we introduced 3 ensemble machine methods to integrated the predictions from 5 commonly used motif exploration algorithms. Besides the conventional PWM scanning model, we developed w-scanning model, which uses the feature importance measure abstracted from a binary classification as model to identify significant (overrepresented) k-mers and potential binding site across the entire genome. With both rescanning models, the comprehensive ensemble machine provided an better alternative to the conventional PWM model in the DNA binding analysis. We also introduced the definition of PWM k-mer subspaces, which provide a dimension reduction tool for sequence analysis and, to some degree, enables machine learning method to success in small sample situation. We test the performance in identifying gene targets and binding motifs over 88 yeast transcription factors. The ensemble method is able to integrate the orthogonal information from different weak learners to method and to perform consistently well for more transcription factors. It is useful in completing the transcription regulatory network for the entire genome. Note the ensemble is easily extended to include more tools as well as more information in the
What problem does this paper attempt to address?