Protein-Protein Interactions Extraction From Biomedical Literatures
Hongfei Lin,Zhihao Yang,Yanpeng Li
DOI: https://doi.org/10.5772/13552
2011-01-01
Abstract:Protein-protein interactions (PPI) play a key role in various aspects of the structural and functional organization of the cell. Knowledge about them unveils the molecular mechanisms of biological processes. A number of databases such as MINT (Zanzoni et al., 2002), BIND (Bader et al., 2003), and DIP (Xenarios et al., 2002) have been created to store protein interaction information in structured and standard formats. However, the amount of biomedical literature regarding protein interactions is increasing rapidly and it is difficult for interaction database curators to detect and curate protein interaction information manually. Thus, most of the protein interaction information remains hidden in the text of the papers in the literature. Therefore, automatic extraction of protein interaction information from biomedical literature has become an important research area. Existing PPI works can be roughly divided into three categories: Manual pattern engineering approaches, Grammar engineering approaches and Machine learning approaches. Manual pattern engineering approaches define a set of rules for possible textual relationships, called patterns, which encode similar structures in expressing relationships. The SUISEKI system uses regular expressions, with probabilities that reflect the experimental accuracy of each pattern to extract interactions into predefined frame structures (Blaschke & Valencia, 2002). Ono et al. manually defined a set of rules based on syntactic features to preprocess complex sentences, with negation structures considered as well (Ono et al., 2001). The BioRAT system uses manually engineered templates that combine lexical and semantic information to identify protein interactions (Corney et al., 2004). Such manual pattern engineering approaches for information extraction are very hard to scale up to large document collections since they require labor-intensive and skilldependent pattern engineering. Grammar engineering approaches use manually generated specialized grammar rules that perform a deep parse of the sentences. Sekimizu et al. used shallow parser, EngCG, to generate syntactic, morphological, and boundary tags (Sekimizu et al., 1998). Based on the tagging results, subjects and objects were recognized for the most frequently used verbs. Fundel et al. proposed RelEx based on the dependency parse trees to extract relations (Fundel et al., 2007). Machine learning techniques for extracting protein interaction information have gained interest in the recent years. In most recent work on machine learning for PPI extraction, the PPI extraction task is casted as learning a decision function that determines for each