Abstract:Clinical Trials, Ahead of Print. Background/AimsPerformance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient's electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records.MethodsOur study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort.ResultsWe found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort.ConclusionBy applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.

Using Natural Language Processing to Identify Different Lens Pathology in Electronic Health Records

Using natural language processing to link patients' narratives to visual capabilities and sentiments

Ophthalmology Operation Note Encoding with Open-Source Machine Learning and Natural Language Processing

Natural Language Processing to Identify Abnormal Breast, Lung, and Cervical Cancer Screening Test Results from Unstructured Reports to Support Timely Follow-up.

Comparison of Diagnosis Codes to Clinical Notes in Classifying Patients with Diabetic Retinopathy

Improving the Identification of Diabetic Retinopathy and Related Conditions in the Electronic Health Record Using Natural Language Processing Methods

Facilitating clinical research through automation: Combining optical character recognition with natural language processing

Natural language processing to identify lupus nephritis phenotype in electronic health records

Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Large-scale Identification of Patients with Cerebral Aneurysms Using Natural Language Processing

Incorporating Natural Language Processing to Improve Classification of Axial Spondyloarthritis Using Electronic Health Records.

Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician

Using natural language processing to identify opioid use disorder in electronic health record data

An accessible, efficient, and accurate natural language processing method for extracting diagnostic data from pathology reports

Identifying Diabetes Related-Complications in a Real-World Free-Text Electronic Medical Records in Hebrew Using Natural Language Processing Techniques

Natural language processing improves identification of colorectal cancer testing in the electronic medical record

A case study in applying artificial intelligence-based named entity recognition to develop an automated ophthalmic disease registry

Machine Learning Methods Using Artificial Intelligence Deployed on Electronic Health Record Data for Identification and Referral of At-Risk Patients From Primary Care Physicians to Eye Care Specialists: Retrospective, Case-Controlled Study

Looking for low vision: Predicting visual prognosis by fusing structured and free-text data from electronic health records

Predicting near-term glaucoma progression: An artificial intelligence approach using clinical free-text notes and data from electronic health records

Natural Language Processing Versus Diagnosis Code-Based Methods for Postherpetic Neuralgia Identification: Algorithm Development and Validation