Abstract:Background: Named entity recognition (NER) models are essential for extracting structured information from unstructured medical texts by identifying entities such as diseases, treatments, and conditions, enhancing clinical decision-making and research. Innovations in machine learning, particularly those involving Bidirectional Encoder Representations From Transformers (BERT)–based deep learning and large language models, have significantly advanced NER capabilities. However, their performance varies across medical datasets due to the complexity and diversity of medical terminology. Previous studies have often focused on overall performance, neglecting specific challenges in medical contexts and the impact of macrofactors like lexical composition on prediction accuracy. These gaps hinder the development of optimized NER models for medical applications. Objective: This study aims to meticulously evaluate the performance of various NER models in the context of medical text analysis, focusing on how complex medical terminology affects entity recognition accuracy. Additionally, we explored the influence of macrofactors on model performance, seeking to provide insights for refining NER models and enhancing their reliability for medical applications. Methods: This study comprehensively evaluated 7 NER models—hidden Markov models, conditional random fields, BERT for Biomedical Text Mining, Big Transformer Models for Efficient Long-Sequence Attention, Decoding-enhanced BERT with Disentangled Attention, Robustly Optimized BERT Pretraining Approach, and Gemma—across 3 medical datasets: Revised Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), BioCreative V CDR, and Anatomical Entity Mention (AnatEM). The evaluation focused on prediction accuracy, resource use (eg, central processing unit and graphics processing unit use), and the impact of fine-tuning hyperparameters. The macrofactors affecting model performance were also screened using the multilevel factor elimination algorithm. Results: The fine-tuned BERT for Biomedical Text Mining, with balanced resource use, generally achieved the highest prediction accuracy across the Revised JNLPBA and AnatEM datasets, with microaverage (AVG_MICRO) scores of 0.932 and 0.8494, respectively, highlighting its superior proficiency in identifying medical entities. Gemma, fine-tuned using the low-rank adaptation technique, achieved the highest accuracy on the BioCreative V CDR dataset with an AVG_MICRO score of 0.9962 but exhibited variability across the other datasets (AVG_MICRO scores of 0.9088 on the Revised JNLPBA and 0.8029 on AnatEM), indicating a need for further optimization. In addition, our analysis revealed that 2 macrofactors, entity phrase length and the number of entity words in each entity phrase, significantly influenced model performance. Conclusions: This study highlights the essential role of NER models in medical informatics, emphasizing the imperative for model optimization via precise data targeting and fine-tuning. The insights from this study will notably improve clinical decision-making and facilitate the creation of more sophisticated and effective medical NER models.

Utilizing Large Language Models for Named Entity Recognition in Traditional Chinese Medicine against COVID-19 Literature: Comparative Study

TCMChat: A Generative Large Language Model for Traditional Chinese Medicine

Comparative Analysis of Large Language Models in Chinese Medical Named Entity Recognition

Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge

Large Language Models in Traditional Chinese Medicine: A Scoping Review

Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

A Multigranularity Text Driven Named Entity Recognition CGAN Model for Traditional Chinese Medicine Literatures

P717 Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts

Large Language Models Struggle in Token-Level Clinical Named Entity Recognition

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

Large Language Model in Medical Information Extraction from Titles and Abstracts with Prompt Engineering Strategies: A Comparative Study of GPT-3.5 and GPT-4

TCM-GPT: Efficient Pre-training of Large Language Models for Domain Adaptation in Traditional Chinese Medicine

Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis

How far is Language Model from 100% Few-shot Named Entity Recognition in Medical Domain

Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study

Lingdan: enhancing encoding of traditional Chinese medicine knowledge for clinical reasoning tasks with large language models

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations

Named Entity Recognition in Chinese Medical Literature Using Pretraining Models