INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large Language Models and Ensemble Learning

Pablo Romero,Lifeng Han,Goran Nenadic
2024-09-29
Abstract:Medication Extraction and Mining play an important role in healthcare NLP research due to its practical applications in hospital settings, such as their mapping into standard clinical knowledge bases (SNOMED-CT, BNF, etc.). In this work, we investigate state-of-the-art LLMs in text mining tasks on medications and their related attributes such as dosage, route, strength, and adverse effects. In addition, we explore different ensemble learning methods (\textsc{Stack-Ensemble} and \textsc{Voting-Ensemble}) to augment the model performances from individual LLMs. Our ensemble learning result demonstrated better performances than individually fine-tuned base models BERT, RoBERTa, RoBERTa-L, BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and PubMedBERT across general and specific domains. Finally, we build up an entity linking function to map extracted medical terminologies into the SNOMED-CT codes and the British National Formulary (BNF) codes, which are further mapped to the Dictionary of Medicines and Devices (dm+d), and ICD. Our model's toolkit and desktop applications are publicly available at \url{<a class="link-external link-https" href="https://github.com/HECTA-UoM/ensemble-NER" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is in the field of medical natural language processing (NLP), how to effectively extract drugs and their related attributes (such as dosage, route of administration, strength, and side effects, etc.) from unstructured text and automatically map them to standard clinical knowledge bases (such as SNOMED - CT, BNF, etc.). Specifically, the paper explores the following aspects: 1. **Drug Information Extraction and Mining**: - Extract drug names and their related attributes (dosage, route of administration, strength, side effects, frequency, duration, dosage form, and reason, etc.) in medical texts. - Automatically map these extracted terms to standard clinical terminologies (such as SNOMED - CT and BNF) to achieve automated clinical coding. 2. **Model Performance Improvement**: - Research the performance of the state - of - the - art large language models (LLMs) in drug information extraction tasks. - Explore different ensemble learning methods (such as STACK - ENSEMBLE and VOTING - ENSEMBLE) to enhance the performance of a single LLM. 3. **Application of Ensemble Learning**: - Improve the accuracy of named entity recognition (NER) tasks by integrating multiple pre - trained language models (such as BERT, RoBERTa, BioBERT, ClinicalBERT, etc.). - Compare the effects of different integration strategies (voting and stacking) and evaluate their performance on clinical texts. 4. **Entity Linking Function**: - Build an entity linking function to map the extracted medical terms to SNOMED - CT codes and British National Formulary (BNF) codes, and further map them to the Dictionary of Medicines and Devices (dm + d) and International Classification of Diseases (ICD). 5. **User Tool Development**: - Develop desktop applications and Web interfaces to enable users to conveniently use these models for drug information extraction and entity linking. ### Formula Summary The formulas involved in the paper are mainly used to evaluate model performance, mainly including the following metrics: - **Precision**: \[ P=\frac{TP}{TP + FP} \] where \(TP\) is true positive and \(FP\) is false positive. - **Recall**: \[ R = \frac{TP}{TP+FN} \] where \(FN\) is false negative. - **F1 Score**: \[ F1=2\times\frac{P\times R}{P + R} \] - **Accuracy**: \[ Acc=\frac{TP + TN}{TP+TN + FP+FN} \] where \(TN\) is true negative. Through these metrics, the paper evaluates the performance of different models and integration methods in drug information extraction tasks and demonstrates the effectiveness of ensemble learning methods.