Classification Systems for Bacterial Protein-Protein Interaction Document Retrieval

Hongfang Liu,Manabu Torii,Guixian Xu,Johannes Goll
DOI: https://doi.org/10.4018/jcmam.2010072003
2010-01-01
International Journal of Computational Models and Algorithms in Medicine
Abstract:Protein-protein interaction (PPI) networks are essential to understand the fundamental processes governing cell biology. Recently, studying PPI networks becomes possible due to advances in experimental high-throughput genomics and proteomics technologies. Many interactions from such high-throughput studies and most interactions from small-scale studies are reported only in the scientific literature and thus are not accessible in a readily analyzable format. This has led to the birth of manual curation initiatives such as the International Molecular Exchange Consortium (IMEx). The manual curation of PPI knowledge can be accelerated by text mining systems to retrieve PPI-relevant articles (article retrieval) and extract PPI-relevant knowledge (information extraction). In this article, the authors focus on article retrieval and define the task as binary classification where PPI-relevant articles are positives and the others are negatives. In order to build such classifier, an annotated corpus is needed. It is very expensive to obtain an annotated corpus manually but a noisy and imbalanced annotated corpus can be obtained automatically, where a collection of positive documents can be retrieved from existing PPI knowledge bases and a large number of unlabeled documents (most of them are negatives) can be retrieved from PubMed. They compared the performance of several machine learning algorithms by varying the ratio of the number of positives to the number of unlabeled documents and the number of features used.
What problem does this paper attempt to address?