Abstract:BackgroundThe medical subdomain of a clinical note, such as cardiology or neurology, is useful content-derived metadata for developing machine learning downstream applications. To classify the medical subdomain of a note accurately, we have constructed a machine learning-based natural language processing (NLP) pipeline and developed medical subdomain classifiers based on the content of the note.MethodsWe constructed the pipeline using the clinical NLP system, clinical Text Analysis and Knowledge Extraction System (cTAKES), the Unified Medical Language System (UMLS) Metathesaurus, Semantic Network, and learning algorithms to extract features from two datasets — clinical notes from Integrating Data for Analysis, Anonymization, and Sharing (iDASH) data repository (n = 431) and Massachusetts General Hospital (MGH) (n = 91,237), and built medical subdomain classifiers with different combinations of data representation methods and supervised learning algorithms. We evaluated the performance of classifiers and their portability across the two datasets.ResultsThe convolutional recurrent neural network with neural word embeddings trained-medical subdomain classifier yielded the best performance measurement on iDASH and MGH datasets with area under receiver operating characteristic curve (AUC) of 0.975 and 0.991, and F1 scores of 0.845 and 0.870, respectively. Considering better clinical interpretability, linear support vector machine-trained medical subdomain classifier using hybrid bag-of-words and clinically relevant UMLS concepts as the feature representation, with term frequency-inverse document frequency (tf-idf)-weighting, outperformed other shallow learning classifiers on iDASH and MGH datasets with AUC of 0.957 and 0.964, and F1 scores of 0.932 and 0.934 respectively. We trained classifiers on one dataset, applied to the other dataset and yielded the threshold of F1 score of 0.7 in classifiers for half of the medical subdomains we studied.ConclusionOur study shows that a supervised learning-based NLP approach is useful to develop medical subdomain classifiers. The deep learning algorithm with distributed word representation yields better performance yet shallow learning algorithms with the word and concept representation achieves comparable performance with better clinical interpretability. Portable classifiers may also be used across datasets from different institutions.

Multilabel classification of medical concepts for patient clinical profile identification

Impact of translation on biomedical information extraction from real-life clinical notes

Learning structures of the French clinical language:development and validation of word embedding models using 21 million clinical reports from electronic health records

Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes

Extracting Family History of Patients from Clinical Narratives: Exploring an End-to-End Solution with Deep Learning Models.

Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions

Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach

De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks

Accelerating Clinical Text Annotation in Underrepresented Languages: A Case Study on Text De-Identification

Efficient labeling of french mammogram reports with MammoBERT

Multi-label Classification for Clinical Text with Feature-level Attention

Applying a Deep Learning-Based Sequence Labeling Approach to Detect Attributes of Medical Concepts in Clinical Text

Efficient Clinical Information Extraction from Breast Radiology Reports in French

A twofold strategy for translating a medical terminology into French

Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes

Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers

Large Language Models for Patient Comments Multi-Label Classification

Transformers for Multi-label Classification of Medical Text: An Empirical Comparison

Domain-specific long text classification from sparse relevant information

Extracting UMLS Concepts from Medical Text Using General and Domain-Specific Deep Learning Models

Automated Drug-Related Information Extraction from French Clinical Documents: ReLyfe Approach