Utilizing Large Language Models for Named Entity Recognition in Traditional Chinese Medicine against COVID-19 Literature: Comparative Study

Xu Tong,Nina Smirnova,Sharmila Upadhyaya,Ran Yu,Jack H. Culbert,Chao Sun,Wolfgang Otto,Philipp Mayr
2024-08-24
Abstract:Objective: To explore and compare the performance of ChatGPT and other state-of-the-art LLMs on domain-specific NER tasks covering different entity types and domains in TCM against COVID-19 literature. Methods: We established a dataset of 389 articles on TCM against COVID-19, and manually annotated 48 of them with 6 types of entities belonging to 3 domains as the ground truth, against which the NER performance of LLMs can be assessed. We then performed NER tasks for the 6 entity types using ChatGPT (GPT-3.5 and GPT-4) and 4 state-of-the-art BERT-based question-answering (QA) models (RoBERTa, MiniLM, PubMedBERT and SciBERT) without prior training on the specific task. A domain fine-tuned model (GSAP-NER) was also applied for a comprehensive comparison. Results: The overall performance of LLMs varied significantly in exact match and fuzzy match. In the fuzzy match, ChatGPT surpassed BERT-based QA models in 5 out of 6 tasks, while in exact match, BERT-based QA models outperformed ChatGPT in 5 out of 6 tasks but with a smaller F-1 difference. GPT-4 showed a significant advantage over other models in fuzzy match, especially on the entity type of TCM formula and the Chinese patent drug (TFD) and ingredient (IG). Although GPT-4 outperformed BERT-based models on entity type of herb, target, and research method, none of the F-1 scores exceeded 0.5. GSAP-NER, outperformed GPT-4 in terms of F-1 by a slight margin on RM. ChatGPT achieved considerably higher recalls than precisions, particularly in the fuzzy match. Conclusions: The NER performance of LLMs is highly dependent on the entity type, and their performance varies across application scenarios. ChatGPT could be a good choice for scenarios where high recall is favored. However, for knowledge acquisition in rigorous scenarios, neither ChatGPT nor BERT-based QA models are off-the-shelf tools for professional practitioners.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to address the issue of evaluating and comparing the performance of large language models (LLMs) in named entity recognition (NER) tasks within Traditional Chinese Medicine (TCM) literature related to combating COVID-19. Specifically: - **Research Background**: TCM has shown effectiveness in fighting COVID-19, but the related literature contains a large amount of cross-domain entity information, requiring effective methods to automatically extract this information. - **Research Objective**: To explore and compare the performance of ChatGPT and other state-of-the-art large language models in NER tasks within TCM literature related to combating COVID-19, covering different types of entities and domains. - **Dataset Construction**: A dataset containing 389 articles on TCM combating COVID-19 was established, with 48 articles manually annotated as benchmark data. - **Model Comparison**: NER tasks were performed using ChatGPT (GPT-3.5 and GPT-4) and four BERT-based question-answering models (RoBERTa, MiniLM, PubMedBERT, and SciBERT), along with a domain fine-tuned model (GSAP-NER) for comprehensive comparison. Through the above research, the paper aims to evaluate the applicability and performance differences of different models in NER tasks within a specific domain, providing valuable references for future research on similar health crises.