Abstract:Motivation: With an overwhelming amount of textual information in molecular biology and biomedicine, there is a need for effective and efficient literature mining and knowledge discovery that can help biologists to gather and make use of the knowledge encoded in text documents. In order to make organized and structured information available, automatically recognizing biomedical entity names becomes critical and is important for information retrieval, information extraction and automated knowledge acquisition. Results: In this paper, we present a named entity recognition system in the biomedical domain, called PowerBioNE. In order to deal with the special phenomena of naming conventions in the biomedical domain, we propose various evidential features: (1) word formation pattern; (2) morphological pattern, such as prefix and suffix; (3) part-of-speech; (4) head noun trigger; (5) special verb trigger and (6) name alias feature. All the features are integrated effectively and efficiently through a hidden Markov model (HMM) and a HMM-based named entity recognizer. In addition, a k -Nearest Neighbor ( k -NN) algorithm is proposed to resolve the data sparseness problem in our system. Finally, we present a pattern-based post-processing to automatically extract rules from the training data to deal with the cascaded entity name phenomenon. From our best knowledge, PowerBioNE is the first system which deals with the cascaded entity name phenomenon. Evaluation shows that our system achieves the F -measure of 66.6 and 62.2 on the 23 classes of GENIA V3.0 and V1.1, respectively. In particular, our system achieves the F -measure of 75.8 on the 'protein' class of GENIA V3.0. For comparison, our system outperforms the best published result by 7.8 on GENIA V1.1, without help of any dictionaries. It also shows that our HMM and the k -NN algorithm outperform other models, such as back-off HMM, linear interpolated HMM, support vector machines, C4.5, C4.5 rules and RIPPER, by effectively capturing the local context dependency and resolving the data sparseness problem. Moreover, evaluation on GENIA V3.0 shows that the post-processing for the cascaded entity name phenomenon improves the F -measure by 3.9. Finally, error analysis shows that about half of the errors are caused by the strict annotation scheme and the annotation inconsistency in the GENIA corpus. This suggests that our system achieves an acceptable F -measure of 83.6 on the 23 classes of GENIA V3.0 and in particular 86.2 on the 'protein' class, without help of any dictionaries. We think that a F -measure of 90 on the 23 classes of GENIA V3.0 and in particular 92 on the 'protein' class, can be achieved through refining of the annotation scheme in the GENIA corpus, such as flexible annotation scheme and annotation consistency, and inclusion of a reasonable biomedical dictionary. Availability: A demo system is available at http://textmining.i2r.a-star.edu.sg/NLS/demo.htm . Technology license is available upon the bilateral agreement.

An automated domain-independent text reading, interpreting and extracting approach for reviewing the scientific literature

Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews

Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

A Novel Framework to Expedite Systematic Reviews by Automatically Building Information Extraction Training Corpora

Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System

An assisted literature review using machine learning models to identify and build a literature corpus

Recognizing Names in Biomedical Texts: a Machine Learning Approach

Natural Language Processing Applications in the Clinical Neurosciences: A Machine Learning Augmented Systematic Review

Natural Language Processing to Facilitate Breast Cancer Research and Management

Investigating Deep-Learning NLP for Automating the Extraction of Oncology Efficacy Endpoints from Scientific Literature

Enhanced Review Detection and Recognition: A Platform-Agnostic Approach with Application to Online Commerce

Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

Assisted neuroscience knowledge extraction via machine learning applied to neural reconstruction metadata on NeuroMorpho.Org

Automating Systematic Literature Reviews with Natural Language Processing and Text Mining: a Systematic Literature Review

In‐depth evaluation of machine learning methods for semi‐automating article screening in a systematic review of mechanistic literature

Harnessing multiple LLMs for Information Retrieval: A case study on Deep Learning methodologies in Biodiversity publications

Advancements and Methodologies in Natural Language Processing and Machine Learning: A Comprehensive Review

Machine learning in medicine: a practical introduction to natural language processing

Review of Natural Language Processing in Pharmacology

A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis

Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research