Abstract:Background: Structural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking. Results: We present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP. Conclusions: The basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.

Automated assembly of molecular mechanisms at scale from text mining and curated databases

Petri Net Models for the Semi-Automatic Construction of Large Scale Biological Networks

Automating Knowledge-Driven Model Recommendation: Methodology, Evaluation, and Key Challenges

Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks

Context-driven interaction retrieval and classification for modeling, curation, and reuse

InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions

Assembling biological boolean networks using manually curated databases and prediction algorithms

JAK/STAT signalling - an executable model assembled from molecule-centred modules demonstrating a module-oriented database concept for systems- and synthetic biology

Automated Biodesign Engineering by Abductive Meta-Interpretive Learning

Human-in-the-loop approach to identify functionally important residues of proteins from literature

Artificial Intelligence for Autonomous Molecular Design: A Perspective

Using residue interaction networks to understand protein function and evolution and to engineer new proteins

Natural language processing in text mining for structural modeling of protein complexes

An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models

Exploring mechanobiology network of bone and dental tissue based on Natural Language Processing

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge

Discovering Patterns to Extract Protein-Protein Interactions from Full Texts

PathwayFinder : Bridging the Way Towards Automatic Pathway Extraction

A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals

Extracting Protein-Protein Interactions (PPIs) from Biomedical Literature using Attention-based Relational Context Information