Abstract:The Unified Medical Language System (UMLS) Metathesaurus construction process mainly relies on lexical algorithms and manual expert curation for integrating over 200 biomedical vocabularies. A lexical-based learning model (LexLM) was developed to predict synonymy among Metathesaurus terms and largely outperforms a rule-based approach (RBA) that approximates the current construction process. However, the LexLM has the potential for being improved further because it only uses lexical information from the source vocabularies, while the RBA also takes advantage of contextual information. We investigate the role of multiple types of contextual information available to the UMLS editors, namely source synonymy (SS), source semantic group (SG), and source hierarchical relations (HR), for the UMLS vocabulary alignment (UVA) problem. In this paper, we develop multiple variants of context-enriched learning models (ConLMs) by adding to the LexLM the types of contextual information listed above. We represent these context types in context-enriched knowledge graphs (ConKGs) with four variants ConSS, ConSG, ConHR, and ConAll. We train these ConKG embeddings using seven KG embedding techniques. We create the ConLMs by concatenating the ConKG embedding vectors with the word embedding vectors from the LexLM. We evaluate the performance of the ConLMs using the UVA generalization test datasets with hundreds of millions of pairs. Our extensive experiments show a significant performance improvement from the ConLMs over the LexLM, namely +5.0% in precision (93.75%), +0.69% in recall (93.23%), +2.88% in F1 (93.49%) for the best ConLM. Our experiments also show that the ConAll variant including the three context types takes more time, but does not always perform better than other variants with a single context type. Finally, our experiments show that the pairs of terms with high lexical similarity benefit most from adding contextual information, namely +6.56% in precision (94.97%), +2.13% in recall (93.23%), +4.35% in F1 (94.09%) for the best ConLM. The pairs with lower degrees of lexical similarity also show performance improvement with +0.85% in F1 (96%) for low similarity and +1.31% in F1 (96.34%) for no similarity. These results demonstrate the importance of using contextual information in the UVA problem.

Similar Word Model for Unfrequent Word Enhancement in Speech Recognition

Low-frequency word enhancement with similar pairs in speech recognition

Recognize Foreign Low-Frequency Words with Similar Pairs

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition.

Multi-View Lstm Language Model With Word-Synchronized Auxiliary Feature For Lvcsr

Recurrent Neural Network Language Model With Structured Word Embeddings For Speech Recognition

Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Future word contexts in neural network language models

Ercnn: Enhanced Recurrent Convolutional Neural Networks For Learning Sentence Similarity

End-to-End Speech Recognition Contextualization with Large Language Models

Learning Effective Word Embedding Using Morphological Word Similarity

Modeling multi-prototype Chinese word representation learning for word similarity

Revisit Word Embeddings with Semantic Lexicons for Modeling Lexical Contrast

Context-Enriched Learning Models for Aligning Biomedical Vocabularies at Scale in the UMLS Metathesaurus

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Empower Your Model with Longer and Better Context Comprehension

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Joint-Character-Poc N-Gram Language Modeling for Chinese Speech Recognition

A Word Language Model Based Contextual Language Processing On Chinese Character Recognition