Abstract:The emergence of large language models (LLMs) has provided robust support for application tasks across various domains, such as name entity recognition (NER) in the general domain. However, due to the particularity of the medical domain, the research on understanding and improving the effectiveness of LLMs on biomedical named entity recognition (BNER) tasks remains relatively limited, especially in the context of Chinese text. In this study, we extensively evaluate several typical LLMs, including ChatGLM2-6B, GLM-130B, GPT-3.5, and GPT-4, on the Chinese BNER task by leveraging a real-world Chinese electronic medical record (EMR) dataset and a public dataset. The experimental results demonstrate the promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for Chinese BNER tasks. More importantly, instruction fine-tuning significantly enhances the performance of LLMs. The fine-tuned offline ChatGLM2-6B surpassed the performance of the task-specific model BiLSTM+CRF (BC) on the real-world dataset. The best fine-tuned model, GPT-3.5, outperforms all other LLMs on the publicly available CCKS2017 dataset, even surpassing half of the baselines; however, it still remains challenging for it to surpass the state-of-the-art task-specific models, i.e., Dictionary-guided Attention Network (DGAN). To our knowledge, this study is the first attempt to evaluate the performance of LLMs on Chinese BNER tasks, which emphasizes the prospective and transformative implications of utilizing LLMs on Chinese BNER tasks. Furthermore, we summarize our findings into a set of actionable guidelines for future researchers on how to effectively leverage LLMs to become experts in specific tasks.

Comparative Analysis of Listwise Reranking with Large Language Models in Limited-Resource Language Contexts

Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages

Self-Calibrated Listwise Reranking with Large Language Models

Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages

Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models

Zero-Shot Listwise Document Reranking with a Large Language Model

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

Make Large Language Model a Better Ranker

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models

Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models

Large Language Models Are Zero-Shot Rankers for Recommender Systems

Ranking Large Language Models without Ground Truth

How good are Large Language Models on African Languages?

EcoRank: Budget-Constrained Text Re-ranking Using Large Language Models

Ranked List Truncation for Large Language Model-based Re-Ranking

Comparative Analysis of Large Language Models in Chinese Medical Named Entity Recognition

Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents

Instruction Distillation Makes Large Language Models Efficient Zero-shot Rankers

Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models

A Two-Stage Adaptation of Large Language Models for Text Ranking