Abstract:Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.

Clinical multi-label free text classification by exploiting disease label relation

Multi-label Classification for Clinical Text with Feature-level Attention

Traditional Chinese Medicine Clinical Records Classification Using Knowledge-Powered Document Embedding

Applicability of Machine Learning Methods to Multi-label Medical Text Classification

Seeing The Whole Patient: Using Multi-Label Medical Text Classification Techniques to Enhance Predictions of Medical Codes

Exploiting Distributional Semantics to Benefit Machine Learning in Automated Classification of Chinese Clinical Text.

Multi-label local awareness and global co-occurrence priori learning improve chest X-ray classification

MCICT: Graph convolutional network-based end-to-end model for multi-label classification of imbalanced clinical text

Triplet attention and dual-pool contrastive learning for clinic-driven multi-label medical image classification

Label correlation guided discriminative label feature learning for multi-label chest image classification

MVKT-ECG: Efficient single-lead ECG classification for multi-label arrhythmia by multi-view knowledge transferring

Applying a Deep Learning-Based Sequence Labeling Approach to Detect Attributes of Medical Concepts in Clinical Text

A Disease Labeler for Chinese Chest X-Ray Report Generation

MVKT-ECG: Efficient Single-lead ECG Classification on Multi-Label Arrhythmia by Multi-View Knowledge Transferring

A clinical text classification paradigm using weak supervision and deep representation

Enhancing the Performance of Multi-Category Text Classification via Label Relation Mining

A Label Information Aware Model for Multi-label Text Classification

DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis Prediction

A Clinical Decision Support Framework for Heterogeneous Data Sources

Semi-Supervised Learning for Multi-Label Cardiovascular Diseases Prediction:A Multi-Dataset Study

Multi-Label Learning By Exploiting Label Correlations For Tcm Diagnosing Parkinson'S Disease