Abstract:Clinical Trials, Ahead of Print. Background/AimsPerformance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient's electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records.MethodsOur study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort.ResultsWe found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort.ConclusionBy applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.

Automated data extraction of bar chart raster images

Automatic Analysis of Microaneurysms Turnover to Diagnose the Progression of Diabetic Retinopathy.

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

ChartParser: Automatic Chart Parsing for Print-Impaired

Facilitating clinical research through automation: Combining optical character recognition with natural language processing

Extraction of Text from Optic Nerve Optical Coherence Tomography Reports

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents

Automatic retinal image diagnosis system for mass health screenings

Improved Recognition of Figures Containing Fluorescence Microscope Images in Online Journal Articles Using Graphical Models.

Automated Data Transformation and Feature Extraction for Oxygenation-Sensitive Cardiovascular Magnetic Resonance Images

Automated extraction of retinal vasculature

Image processing based data extraction from graphical representation

Expert-level Automated Biomarker Identification in Optical Coherence Tomography Scans

An automatic system for extracting figure-caption pair from medical documents: a six-fold approach

Improving tabular data extraction in scanned laboratory reports using deep learning models

Validity and Reliability Analysis of the PlotDigitizer Software Program for Data Extraction from Single-Case Graphs

Analysis of hybrid statistical textural and intensity features to discriminate retinal abnormalities through classifiers

Clinical Features Distinguishing Diabetic Retinopathy Severity Using Artificial Intelligence

RV-ESA: A novel computer-aided elastic shape analysis system for retinal vessels in diabetic retinopathy

Exporting Diabetic Retinopathy Images from VA VistA Imaging for Research

Clinical Features for Detecting Diabetic Macular Edema using Artificial Intelligence