Abstract:Background: The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. Results: We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task's development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew's Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Conclusions: Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.

Exploiting and Integrating Rich Features for Biological Literature Classification

Improving Short Text Classification Through Better Feature Space Selection

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

Classifying protein-protein interaction articles using word and syntactic features

A hybrid medical text classification framework: Integrating attentive rule construction and neural network

Mining physical protein-protein interactions by exploiting abundant features

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Biotopic: A Topic-Driven Biological Literature Mining System

Featurization strategies for protein–ligand interactions and their applications in scoring function development

Progress and Opportunities of Foundation Models in Bioinformatics

A Hybrid Method for Relation Extraction from Biomedical Literature

Comparative analysis of classification techniques for topic-based biomedical literature categorisation

Feature Selection and Combination Criteria for Improving Predictive Accuracy in Protein Structure Classification.

Improving prediction accuracy for protein structure classification by neural network using feature combination

Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature

ACDMBI: A deep learning model based on community division and multi-source biological information fusion predicts essential proteins

Extracting LncRNA-protein Interactions from Literature Using a Text Feature-based Approach

An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis

Identifying Functions of Proteins in Mice With Functional Embedding Features

Classification in Histopathology: A unique deep embeddings extractor for multiple classification tasks