Abstract:Background: Deciphering physical protein-protein interactions is fundamental to elucidating both the functions of proteins and biological processes. The development of high-throughput experimental technologies such as the yeast two-hybrid screening has produced an explosion in data relating to interactions. Since manual curation is intensive in terms of time and cost, there is an urgent need for text-mining tools to facilitate the extraction of such information. The BioCreative (Critical Assessment of Information Extraction systems in Biology) challenge evaluation provided common standards and shared evaluation criteria to enable comparisons among different approaches. Results: During the benchmark evaluation of BioCreative 2006, all of our results ranked in the top three places. In the task of filtering articles irrelevant to physical protein interactions, our method contributes a precision of 75.07%, a recall of 81.07%, and an AUC (area under the receiver operating characteristic curve) of 0.847. In the task of identifying protein mentions and normalizing mentions to molecule identifiers, our method is competitive among runs submitted, with a precision of 34.83%, a recall of 24.10%, and an F1 score of 28.5%. In extracting protein interaction pairs, our profile-based method was competitive on the SwissProt-only subset (precision = 36.95%, recall = 32.68%, and F1 score = 30.40%) and on the entire dataset (30.96%, 29.35%, and 26.20%, respectively). From the biologist's point of view, however, these findings are far from satisfactory. The error analysis presented in this report provides insight into how performance could be improved: three-quarters of false negatives were due to protein normalization problems (532/698), and about one-quarter were due to problems with correctly extracting interactions for this system. Conclusion: We present a text-mining framework to extract physical protein-protein interactions from the literature. Three key issues are addressed, namely filtering irrelevant articles, identifying protein names and normalizing them to molecule identifiers, and extracting protein-protein interactions. Our system is among the top three performers in the benchmark evaluation of BioCreative 2006. The tool will be helpful for manual interaction curation and can greatly facilitate the process of extracting protein-protein interactions.

Automatic Detection and Extraction of Key Resources from Tables in Biomedical Papers

A framework for information extraction from tables in biomedical literature

Tablepedia: Automating PDF Table Reading in an Experimental Evidence Exploration and Analytic System

TEXUS: Table Extraction System for PDF Documents

Benchmarking table recognition performance on biomedical literature on neurological disorders

On methods and tools of table detection, extraction and annotation in PDF documents

A Novel Framework to Expedite Systematic Reviews by Automatically Building Information Extraction Training Corpora

Tables to LaTeX: structure and content extraction from scientific tables

Figure mining for biomedical research

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

A Study on Reproducibility and Replicability of Table Structure Recognition Methods

Automated Extraction and Maturity Analysis of Open Source Clinical Informatics Repositories from Scientific Literature

Automated Mass Extraction of Over 680,000 PICOs from Clinical Study Abstracts Using Generative AI: A Proof-of-Concept Study

Empowering biologists to decode omics data: the Genekitr R package and web server

TabLeX: A Benchmark Dataset for Structure and Content Information Extraction from Scientific Tables

arXiVeri: Automatic table verification with GPT

RIscoper: a Tool for RNA-RNA Interaction Extraction from the Literature.

Mining physical protein-protein interactions from the literature

Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations

MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature

MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format