Efficient Biomedical Entity Linking: Clinical Text Standardization with Low-Resource Techniques

Akshit Achara,Sanand Sasidharan,Gagan N
2024-05-24
Abstract:Clinical text is rich in information, with mentions of treatment, medication and anatomy among many other clinical terms. Multiple terms can refer to the same core concepts which can be referred as a clinical entity. Ontologies like the Unified Medical Language System (UMLS) are developed and maintained to store millions of clinical entities including the definitions, relations and other corresponding information. These ontologies are used for standardization of clinical text by normalizing varying surface forms of a clinical term through Biomedical entity linking. With the introduction of transformer-based language models, there has been significant progress in Biomedical entity linking. In this work, we focus on learning through synonym pairs associated with the entities. As compared to the existing approaches, our approach significantly reduces the training data and resource consumption. Moreover, we propose a suite of context-based and context-less reranking techniques for performing the entity disambiguation. Overall, we achieve similar performance to the state-of-the-art zero-shot and distant supervised entity linking techniques on the Medmentions dataset, the largest annotated dataset on UMLS, without any domain-based training. Finally, we show that retrieval performance alone might not be sufficient as an evaluation metric and introduce an article level quantitative and qualitative analysis to reveal further insights on the performance of entity linking methods.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of Biomedical Entity Linking (BEL) in clinical text standardization. Specifically, the paper focuses on the following aspects: 1. **Handling Polysemy and Synonymy**: Clinical texts contain a large number of polysemous and synonymous terms, which may refer to the same core concept (i.e., clinical entity). Existing entity linking methods face challenges in handling this diversity. 2. **Resource Consumption**: Existing entity linking methods typically require a large amount of training data and computational resources, limiting their application in low-resource environments. 3. **Entity Disambiguation**: During the candidate entity generation process, multiple entities may have similar similarity scores, making disambiguation difficult. 4. **Limitations of Evaluation Metrics**: Traditional evaluation metrics (such as retrieval performance) may not be sufficient to comprehensively evaluate the effectiveness of entity linking methods, necessitating the introduction of more evaluation dimensions. ### Solutions To address the above issues, the paper proposes the following methods: 1. **Learning Based on Synonym Pairs**: By utilizing synonym pairs in UMLS to train the model, the required amount of training data and resource consumption is significantly reduced. 2. **Context-Independent and Context-Dependent Re-ranking Techniques**: A series of re-ranking techniques are proposed for disambiguation after generating candidate entities. These techniques include parameterized re-ranking and re-ranking based on UMLS semantic information. 3. **Comprehensive Evaluation Method**: A comprehensive evaluation method combining article-level semantic similarity with strict matching and related matching is introduced to more comprehensively evaluate the performance of entity linking methods. ### Main Contributions 1. **Dataset and Model Training**: It is demonstrated that the pre-trained MiniLM model can achieve good performance with a small amount of training data, and that performance after fine-tuning on UMLS synonym pairs is actually inferior to the non-fine-tuned model. 2. **Entity Disambiguation**: It is proven that re-ranking methods based on semantic information provided by UMLS are very effective in entity disambiguation, and a parameterized re-ranking technique suitable for alias-based entity linking solutions is proposed. 3. **Evaluation Method**: A comprehensive entity linking evaluation method is proposed, utilizing the semantic representation of articles combined with strict matching and related matching, revealing issues such as annotation granularity, context loss, and surface form bias. ### Experimental Results - **Candidate Generation Performance**: On the Medmentions dataset, the proposed model achieved a retrieval performance of approximately 87% in the top-128 candidate entities. - **Re-ranking Performance**: Through re-ranking, the R@1 performance improved by more than 10%. - **Comprehensive Evaluation**: Through a comprehensive evaluation of article-level semantic similarity and retrieval performance, the shortcomings of existing methods in terms of annotation granularity and context loss are revealed. ### Conclusion The multi-stage method proposed in the paper performs excellently in biomedical entity linking, significantly improving precision while maintaining high recall. Additionally, the method has low costs in training, prototype space creation, and inference, making it suitable for application in resource-limited environments. Future research can further explore methods for handling polysemy and synonymy, as well as introduce partial scoring mechanisms to more accurately evaluate prediction quality.