Predicting implicit associated cancer genes from OMIM and MEDLINE by a new probabilistic model

Shanfeng Zhu,Yasushi Okuno,Gozoh Tsujimoto,Hiroshi Mamitsuka
DOI: https://doi.org/10.1186/1752-0509-1-S1-P16
2007-01-01
BMC Systems Biology
Abstract:Background Discovering cancer associated genes can facilitate the understanding of tumour pathogenesis, the medical diagnoses and the treatment of patients. Here we mined OMIM and MEDLINE to discover implicitly associated cancer genes by applying a new probabilistic model, mixture aspect model (MAM) [1], on cancer gene co-occurrence data in OMIM and MEDLINE. Through crossvalidation experiments, the accuracy of predicting associated cancer genes was shown to be improved by incorporating gene-gene co-occurrence pairs from MEDLINE into cancer-gene co-occurrence pairs in OMIM. Furthermore, some implicit associated cancer genes were predicted and analyzed preliminarily. The detailed result was presented on line http://www.bic.kyoto-u.ac.jp/pathway/zhusf/ CancerInformatics/Supplemental2006.html for the reference of interested researchers and further validation by biologists. Materials and methods We extracted cancer-gene and cancer-cancer co-occurrence pairs from OMIM, a human curated knowledgebase on human genes and inherited diseases. A software tool CGMIM was used to extract the description section of OMIM to obtain cancers and associated genes [2]. This software maps genetic disorders into 21 different types of cancers. To avoid the difficulty of recognizing gene names, we extracted a human curated database, Entrez Gene, to obtain a subset of high quality MEDLINE records, where we obtained gene-gene co-occurrence data. MAM was proposed by us to mine implicit chemical compound-gene relations by integrating three types of co-occurrence data (compound-compound, gene-gene and compound-gene) in the literature [1]. The main advantage of MAM is the ability of integrating different type of co-occurrence data from heterogeneous data sources. MAM was first estimated by an EM algorithm to fit the existing co-occurrence data of cancer and gene, and then was used to predict the likelihood of the association of an unobserved pair of a cancer and a gene. See Table 1. from BioSysBio 2007: Systems Biology, Bioinformatics and Synthetic Biology Manchester, UK. 11–13 January 2007
What problem does this paper attempt to address?