On the impact of data integration and edge enrichment in mining significant signals from biological networks

Sean West,Hesham Ali
DOI: https://doi.org/10.1145/2649387.2660846
2014-09-20
Abstract:The influx of high-throughput biotechnologies has resulted in considerable amounts of available and untapped data, useful for both interpretation and extrapolation. Due to the fact that the noise to signal ratio in most biological databases are non-trivial, single source analysis techniques may suffer from relatively high false-positive and false-negative rates. In addition, use of a single data source does not allow for the discovery of the novel relationships that can only be derived from multiple sources. Recently, the use of gene correlation networks has emerged to assist in the discovery of previously unknown genetic relationships and the identification of significant biological functions. Such networks provide a useful mechanism to model experimental results obtained from expression data and capture a snapshot of the expression as well as the temporal changes in various experiments. In addition, gene Ontology is often integrated with biological networks within the analysis process as a source of domain knowledge. In this project, we evaluate the use of Gene Ontology, not simply as an assessment tool, but as a basic component in building the correlation networks. We implemented a network integration algorithm that uses both gene expression data (experimental knowledge) and gene ontology data (domain knowledge) to build a biologically-rich correlation model. Then, we analyzed the resulting networks for topological changes and biological significance changes. Our main hypothesis is that the integrated networks would reduce the harmful effects of outliers from imperfect data while maintaining the high concentration of network substructures that are likely to reveal novel, biologically-significant relationships. In addition, using the concept of "guilt by association", we analyzed the clusters of the integrated networks and found that there was a significant increase of enrichment scores relative to the original networks. We show, through motif and pathway analysis, that integrated networks tend to cluster with higher biological significance.
What problem does this paper attempt to address?