Abstract:Objective: Breast cancer is the most common malignant tumor among women. The diagnosis and treatment information of breast cancer patients is abundant in multiple types of clinical fields, including clinicopathological data, genotype and phenotype information, treatment information, and prognosis information. However, current studies are mainly focused on extracting information from one specific type of clinical field. This study defines a comprehensive information model to represent the whole-course clinical information of patients. Furthermore, deep learning approaches are used to extract the concepts and their attributes from clinical breast cancer documents by fine-tuning pretrained Bidirectional Encoder Representations from Transformers (BERT) language models. Materials and methods: The clinical corpus that was used in this study was from one 3A cancer hospital in China, consisting of the encounter notes, operation records, pathology notes, radiology notes, progress notes and discharge summaries of 100 breast cancer patients. Our system consists of two components: a named entity recognition (NER) component and a relation recognition component. For each component, we implemented deep learning-based approaches by fine-tuning BERT, which outperformed other state-of-the-art methods on multiple natural language processing (NLP) tasks. A clinical language model is first pretrained using BERT on a large-scale unlabeled corpus of Chinese clinical text. For NER, the context embeddings that were pretrained using BERT were used as the input features of the Bi-LSTM-CRF (Bidirectional long-short-memory-conditional random fields) model and were fine-tuned using the annotated breast cancer notes. Furthermore, we proposed an approach to fine-tune BERT for relation extraction. It was considered to be a classification problem in which the two entities that were mentioned in the input sentence were replaced with their semantic types. Results: Our best-performing system achieved F1 scores of 93.53% for the NER and 96.73% for the relation extraction. Additional evaluations showed that the deep learning-based approaches that fine-tuned BERT did outperform the traditional Bi-LSTM-CRF and CRF machine learning algorithms in NER and the attention-Bi-LSTM and SVM (support vector machines) algorithms in relation recognition. Conclusion: In this study, we developed a deep learning approach that fine-tuned BERT to extract the breast cancer concepts and their attributes. It demonstrated its superior performance compared to traditional machine learning algorithms, thus supporting its uses in broader NER and relation extraction tasks in the medical domain.

Design and implementation of information extraction system for scientific literature using fine-tuned deep learning models

Deep learning to refine the identification of high-quality clinical research articles from the biomedical literature: Performance evaluation

Automatic Document Metadata Extraction Based on Deep Networks.

Deep Learning for Medical Text Processing: BERT Model Fine-Tuning and Comparative Study

A Deep Learning Approach to Refine the Identification of High-Quality Clinical Research Articles From the Biomedical Literature: Protocol for Algorithm Development and Validation

Deep scaled dot-product attention based domain adaptation model for biomedical question answering

Information Extraction of Chinese Medical Electronic Records Via Evolutionary Neural Architecture Search

A general approach for improving deep learning-based medical relation extraction using a pre-trained model and fine-tuning

Extracting comprehensive clinical information for breast cancer using deep learning methods

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Information extraction from weakly structured radiological reports with natural language queries

Ensemble pretrained language models to extract biomedical knowledge from literature

Clinical Named Entity Recognition Using Deep Learning Models.

De-identification of Clinical Text via Bi-LSTM-CRF with Neural Language Models.

Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition

Progress Notes Classification and Keyword Extraction using Attention-based Deep Learning Models with BERT

Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification

Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews

Using pretraining and text mining methods to automatically extract the chemical scientific data

Deep learning-based NLP Data Pipeline for EHR Scanned Document Information Extraction

Extraction of Information Related to Adverse Drug Events from Electronic Health Record Notes: Design of an End-to-End Model Based on Deep Learning