VANER: Leveraging Large Language Model for Versatile and Adaptive Biomedical Named Entity Recognition

Junyi Biana,Weiqi Zhai,Xiaodi Huang,Jiaxuan Zheng,Shanfeng Zhu
2024-04-27
Abstract:Prevalent solution for BioNER involves using representation learning techniques coupled with sequence labeling. However, such methods are inherently task-specific, demonstrate poor generalizability, and often require dedicated model for each dataset. To leverage the versatile capabilities of recently remarkable large language models (LLMs), several endeavors have explored generative approaches to entity extraction. Yet, these approaches often fall short of the effectiveness of previouly sequence labeling approaches. In this paper, we utilize the open-sourced LLM LLaMA2 as the backbone model, and design specific instructions to distinguish between different types of entities and datasets. By combining the LLM's understanding of instructions with sequence labeling techniques, we use mix of datasets to train a model capable of extracting various types of entities. Given that the backbone LLMs lacks specialized medical knowledge, we also integrate external entity knowledge bases and employ instruction tuning to compel the model to densely recognize carefully curated entities. Our model VANER, trained with a small partition of parameters, significantly outperforms previous LLMs-based models and, for the first time, as a model based on LLM, surpasses the majority of conventional state-of-the-art BioNER systems, achieving the highest F1 scores across three datasets.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key issues in the task of Biomedical Named Entity Recognition (BioNER): 1. **Task Specificity and Generalization Ability**: - Current BioNER methods mostly rely on representation learning techniques and sequence labeling, which are usually designed for specific tasks and lack good generalization ability. Each dataset often requires a dedicated model, leading to model redundancy and resource waste. 2. **Effectiveness of Generative Models**: - Although some studies attempt to use generative pre-trained models (such as GPT) for entity extraction, these methods usually do not perform as well as traditional sequence labeling methods. 3. **Lack of Domain Knowledge**: - Large language models (LLMs), although performing well in natural language processing tasks, lack expertise in the biomedical field. This limits their performance in BioNER tasks. 4. **Challenges of Multi-Dataset Training**: - There are multiple datasets in the biomedical field with inconsistent annotation standards. Directly concatenating these datasets leads to annotation inconsistencies, affecting model performance. ### Solutions To address the above issues, the paper proposes the VANER model, which has the following main features: 1. **Utilizing Large Language Models (LLMs)**: - Using the open-source LLaMA2 as the backbone model and designing specific instructions to distinguish different types of entities and datasets. By combining the understanding ability of LLMs with sequence labeling techniques, the model can be trained on various datasets to extract different types of entities. 2. **Dense Biomedical Entity Recognition (DBR)**: - To compensate for the lack of knowledge in the biomedical field by LLMs, an external entity knowledge base (such as UMLS) is introduced, and instruction tuning is used to enable the model to densely recognize well-curated entities. This approach not only enhances the model's knowledge understanding ability but also improves the model's convergence speed and performance. 3. **Multi-Dataset Instruction Tuning**: - By performing instruction tuning on multiple biomedical NER datasets, the model can better adapt to different annotation standards, thereby improving overall performance. 4. **Resource Efficiency**: - This method requires only a single 4090 GPU for training and inference, making it highly resource-efficient. ### Experimental Results - **Performance Improvement**: VANER achieves state-of-the-art performance on multiple datasets, particularly excelling on the BC4CHEMD, BC5CDR-chem, and Linnaeus datasets. - **Domain Adaptability**: VANER demonstrates strong domain adaptability, performing well on the unseen CRAFT dataset. - **Resource Efficiency**: Compared to traditional methods, VANER is more resource-efficient, requiring only a single 4090 GPU for training and inference. ### Summary By combining the versatility of large language models with sequence labeling techniques, VANER effectively addresses the challenges of task specificity, generalization ability, lack of domain knowledge, and multi-dataset training in BioNER tasks, significantly improving model performance and resource efficiency.