BioMNER: A Dataset for Biomedical Method Entity Recognition

Chen Tang,Bohao Yang,Kun Zhao,Bo Lv,Chenghao Xiao,Frank Guerin,Chenghua Lin
2024-06-29
Abstract:Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.
Computation and Language
What problem does this paper attempt to address?
The focus of this paper is on the problem of BioMethod Named Entity Recognition (BioMethod NER) in the field of biomedical methodology. Currently, the main challenge in this field is the emergence of constantly evolving specialized terms, which makes it difficult to identify methods in literature. The study points out that existing resources are scarce, particularly due to the complexity of method concepts, requiring a deeper understanding for accurate annotation. Therefore, the paper proposes a new dataset to assist in the automated BioMethod entity recognition and information retrieval system, in order to facilitate human annotation. The paper also compares traditional and modern open-domain NER methods, including the use of large language models (LLMs) such as BERT, and finds that the parameter size of large models might hinder the effective absorption of extraction patterns related to biomedical methods. The study shows that a lightweight ALBERT model combined with Conditional Random Fields (CRF) achieves state-of-the-art performance in this task. Additionally, the paper proposes an annotation assistance system that utilizes rules and ChatGPT to identify potential biomedical entity candidates, and then retrieves relevant information from ChatGPT and Wikipedia for annotators to reference, thereby improving the quality and efficiency of annotations. The main contributions of the paper include: 1. Introducing an annotation assistance system that accelerates the identification of biomedical method entities. 2. Creating a high-quality dataset specifically for BioMethod NER. 3. Conducting extensive experiments using various machine learning techniques to analyze their capabilities in addressing BioMethod NER challenges. Through experiments, the paper reveals the limitations of large-scale language models in handling named entity recognition in specific domains, and highlights the advantages of traditional methods like CRF in certain situations.